peer-to-peer epoll clients and deadlock - networking

Suppose a peer-to-peer program uses epoll to perform asynchronous TCP reads from and writes to multiple peers. Naturally, this means that every file descriptor is set to nonblocking to allow epoll_wait to be called and for multiple sockets to be checked.
However, there is a potential issue. Suppose there are two peers: A and B. A tries to write a message to B, but B is congested or something and so the call to write returns -1 with errno set to EAGAIN. At this point, A goes to sleep on the call to epoll_wait.
But note that B is already stuck on its own call to epoll_wait. If B is never notified about A's failed attempt to send it a message, then B will never wake up and try to perform a read on A's socket, and the entire thing will deadlock. So my question is, is B guaranteed be notified that A is attempting to send it a message, even if A gives up on the original write call and goes to sleep?
Even if the answer to the above is "yes", is it possible for a system like this to deadlock indefinitely because of application-layer desynchronization? i.e. A tries to write to B but fails, so it goes to sleep. Then B wakes up and tries to read from A, but fails because A went to sleep. etc.

Any protocol that had a possible state where both sides are permitted to wait for the other side to read before they read would be a fundamentally broken protocol. For peer-to-peer protocols, typically each end is prohibited from delaying reads just because it cannot write.
On the implementation side, typically every call to epoll_wait (or the equivalent way you discover ready I/O) checks for input on all descriptors the program is using. Reading is never deferred unless the application has unprocessed data that it has already read and it stops deferring as soon as that data is processed. Waiting for network activity before reading is generally a very bad idea.
This is why typical protocol-neutral TCP proxies use two processes or two threads. You can't just read from A and then go do a blocking write to B because you don't know if B is required to read before it writes.
This is also why calling recv with MSG_WAITALL is almost always a bad idea. The other end might be waiting for you to receive the bytes it has already sent before it sends any more. No protocol can allow one side to wait for all the bytes to be sent before reading any of them if it also allows the other side to wait until some bytes have been read before sending the rest of them!

Related

MPI standard 3: when synchronous send is complete?

In the MPI Standard Section 3.4 (page 37):http://mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf
the synchronous send completion means
1. the send-buffer can be reused
2. the receiver has started to receive data.
The standard says "has started" instead of "has completed", so I have a question about this: Imagine a case:
The sender calls MPI_Ssend, then a receiver is matched and has started to receive data. At this time, the send is complete and returned. As the MPI standard said, the send-buffer can be reused, so the sender modifies some data of the send-buffer. At the same time, the receiver is receiving data very slowly (e.g. network is very bad), so how can we guarantee the data finally received by the receiver is same as the original data stored in sender's send-buffer?
Ssend is synchronous. It means that Ssend cannot return before the corresponding Recv is called.
Ssend is Blocking. It means that the function return only when it is safe to touch the "send-buffer".
Synchronous and blocking are 2 different thing, I know it can be confusing.
Most implementation of Send works as follow (MPICH,OpenMPI,CRAY-MPI):
For small message the send-buffer is copied to the memory which is reserved for MPI. As soon as the copy is done the send return.
For large message, no copy are done, therefore the Send return once the entire send-buffer has been send to the network (which cannot be done before the Revc has been called, to avoid to overload the network memory)
So a MPI_Send is: Blocking, asynchronous for small message,synchronous for large one.
A Ssend works as follow:
As soon as the Recv is started AND the send-buffer is either copied or fully in the network, the Ssend return.
Ssend should be avoided as much as one can. As it slow down the communication (due to the fact that the network need to tell the sender that the recv has started)

TCP client-server SIGPIPE

I am designing and testing a client server program based on TCP sockets(Internet domain). Currently , I am testing it on my local machine and not able to understand the following about SIGPIPE.
*. SIGPIPE appears quite randomly. Can it be deterministic?
The first tests involved single small(25 characters) send operation from client and corresponding receive at server. The same code, on the same machine runs successfully or not(SIGPIPE) totally out of my control. The failure rate is about 45% of times(quite high). So, can I tune the machine in any way to minimize this.
**. The second round of testing was to send 40000 small(25 characters) messages from the client to the server(1MB of total data) and then the server responding with the total size of data it actually received. The client sends data in a tight loop and there is a SINGLE receive call at the server. It works only for a maximum of 1200 bytes of total data sent and again, there are these non deterministic SIGPIPEs, about 70% times now(really bad).
Can some one suggest some improvement in my design(probably it will be at the server). The requirement is that the client shall be able to send over medium to very high amount of data (again about 25 characters each message) after a single socket connection has been made to the server.
I have a feeling that multiple sends against a single receive will always be lossy and very inefficient. Shall we be combining the messages and sending in one send() operation only. Is that the only way to go?
SIGPIPE is sent when you try to write to an unconnected pipe/socket. Installing a handler for the signal will make send() return an error instead.
signal(SIGPIPE, SIG_IGN);
Alternatively, you can disable SIGPIPE for a socket:
int n = 1;
setsockopt(thesocket, SOL_SOCKET, SO_NOSIGPIPE, &n, sizeof(n));
Also, the data amounts you're mentioning are not very high. Likely there's a bug somewhere that causes your connection to close unexpectedly, giving a SIGPIPE.
SIGPIPE is raised because you are attempting to write to a socket that has been closed. This does indicate a probable bug so check your application as to why it is occurring and attempt to fix that first.
Attempting to just mask SIGPIPE is not a good idea because you don't really know where the signal is coming from and you may mask other sources of this error. In multi-threaded environments, signals are a horrible solution.
In the rare cases were you cannot avoid this, you can mask the signal on send. If you set the MSG_NOSIGNAL flag on send()/sendto(), it will prevent SIGPIPE being raised. If you do trigger this error, send() returns -1 and errno will be set to EPIPE. Clean and easy. See man send for details.

Does asynchronous receive guarantee the detection of connection failure?

From what I know, a blocking receive on a TCP socket does not always detect a connection error (due either to a network failure or to a remote-endpoint failure) by returning a -1 value or raising an IO exception: sometimes it could just hang indefinitely.
One way to manage this problem is to set a timeout for the blocking receive. In case an upper bound for the reception time is known, this bound could be set as timeout and the connection could be considered lost simply when the timeout expires; when such an upper bound is not known a priori, for example in a pub-sub system where a connection stays open to receive publications, the timeout to be set would be somewhat arbitrary but its expiration could trigger a ping/pong request to verify that the connection (and the endpoint too) is still up.
I wonder whether the use of asynchronous receive also manages the problem of detecting a connection failure. In boost::asio I would call socket::asynch_read_some() registering an handler to be asynchronously called, while in java.nio I would configure the channel as non-blocking and register it to a selector with an OP_READ interest flag. I imagine that a correct connection-failure detection would mean that, in the first case the handler would be called with a non-0 error_code, while in the second case the selector would select the faulty channel but a subsequent read() on the channel would either return -1 or throw an IOException.
Is this behaviour guaranteed with asynchronous receive, or could there be scenarios where after a connection failure, for example, in boost::asio the handler will never be called or in java.nio the selector will never select the channel?
Thank you very much.
I believe you're referring to the TCP half-open connection problem (the RFC 793 meaning of the term). Under this scenario, the receiving OS will never receive indication of the lost connection, so it will never notify the app. Whether the app is readding synchronously or asynchronously doesn't enter into it.
The problem occurs when the transmitting side of the connection somehow is no longer aware of the network connection. This can happen, for example, when
the transmitting OS abruptly terminates/restarts (power outage, OS failure/BSOD, etc.).
the transmitting side closes its side while there is a network disruption between the two sides and cleans up its side: e.g transmitting OS reboots cleanly during disruption, transmitting Windows OS is unplugged from the network
When this happens, the receiving side may be waiting for data or a FIN that will never come. Unless the receiving side sends a message, there's no way for it to realize the transmitting side is no longer aware of the receiving side.
Your solution (a timeout) is one way to address the issue, but it should include sending a message to the transmitting side. Again, it doesn't matter the read is synchronous or asynchronous, just that it doesn't read and wait indefinitely for data or a FIN. Another solution is using a TCP KEEPALIVE feature that is supported by some TCP stacks. But the hard part of any generalized solution is usually determining a proper timeout, since the timeout is highly dependent on characteristics of the specific application.
Because of how TCP works, you will typically have to send data in order to notice a hard connection failure, to find out that no ACK packet will ever be returned. Some protocols attempt to identify conditions like this by periodically using a keep-alive or ping packet: if one side does not receive such a packet in X time (and perhaps after trying and failing one itself), it can consider the connection dead.
To answer your question, blocking and non-blocking receive should perform identically except for the act of blocking itself, so both will suffer from this same issue. In order to make sure that you can detect a silent failure from the remote host, you'll have to use a form of keep-alive like I described.

How do I handle partial write completions from overlapped I/O using I/O Completion Ports

On Windows I/O completion ports, say I do this:
void function()
{
WSASend("1111"); // A
WSASend("2222"); // B
WSASend("3333"); // C
}
If I got a "write-complete" that says 3 bytes of WSASend() A were sent, is it possible that right after that I'll get a "write-complete" that tells me that some or all of B & C were sent, or will TCP will hold them until I re-issue a WSASend() call with the rest of A's data? Or will TCP complete it automatically?
I've been developing client and server systems with IOCP for over 10 years now and I've never seen a partial write completion in normal usage. You CAN get them but if you do then the chances are your server is hosed anyway; you'll likely get a write completion with an error of ENOBUFS which tends to mean that you've exhausted non-paged pool or exceeded the locked pages limit.
IMHO you should manage your resources such that you never hit these operating system limits.
If a write completion returns less than the number of bytes that you think you should have written then there's not really much that you can do to recover if you have more writes pending. In the example in your question if A failed then you could only really shutdown the connection as B and C might succeed. In practice you don't have to worry about more than this.
If you know that you don't have any more writes pending on that connection then you COULD issue another write to write the subsequent data.
And TCP doesn't come into this at all.

unix network process

I was wondering how tcp/ip communication is implemented in unix. When you do a send over the socket, does the tcp/level work (assembling packets, crc, etc) get executed in the same execution context as the calling code?
Or, what seems more likely, a message is sent to some other daemon process responsible for tcp communication? This process then takes the message and performs the requested work of copying memory buffers and assembling packets etc.? So, the calling code resumes execution right away and tcp work is done in parallel? Is this correct?
Details would be appreciated. Thanks!
The TCP/IP stack is part of your kernel. What happens is that you call a helper method which prepares a "kernel trap". This is a special kind of exception which puts the CPU into a mode with more privileges ("kernel mode"). Inside of the trap, the kernel examines the parameters of the exception. One of them is the number of the function to call.
When the function is called, it copies the data into a kernel buffer and prepares everything for the data to be processed. Then it returns from the trap, the CPU restores registers and its original mode and execution of your code resumes.
Some kernel thread will pick up the copy of the data and use the network driver to send it out, do all the error handling, etc.
So, yes, after copying the necessary data, your code resumes and the actual data transfer happens in parallel.
Note that this is for TCP packets. The TCP protocol does all the error handling and handshaking for you, so you can give it all the data and it will know what to do. If there is a problem with the connection, you'll notice only after a while since the TCP protocol can handle short network outages by itself. That means you'll have "sent" some data already before you'll get an error. That means you will get the error code for the first packet only after the Nth call to send() or when you try to close the connection (the close() will hang until the receiver has acknowledged all packets).
The UDP protocol doesn't buffer. When the call returns, the packet is on it's way. But it's "fire and forget", so you only know that the driver has put it on the wire. If you want to know whether it has arrived somewhere, you must figure out a way to achieve that yourself. The usual approach is have the receiver send an ack UDP packet back (which also might get lost).
No - there is no parallel execution. It is true that the execution context when you're making a system call is not the same as your usual execution context. When you make a system call, such as for sending a packet over the network, you must switch into the kernel's context - the kernel's own memory map and stack, instead of the virtual memory you get inside your process.
But there are no daemon processes magically dispatching your call. The rest of the execution of your program has to wait for the system call to finish and return whatever values it will return. This is why you can count on return values being available right away when you return from the system call - values like the number of bytes actually read from the socket or written to a file.
I tried to find a nice explanation for how the context switch to kernel space works. Here's a nice in-depth one that even focuses on architecture-specific implementation:
http://www.ibm.com/developerworks/linux/library/l-system-calls/

Resources