Question about persistent communication requests in MPI - mpi

I have come across the following passage from the document "MPI: A Message-Passing Interface Standard Version 4.0":
A persistent communication request is deallocated by a call to MPI_REQUEST_FREE. The call to MPI_REQUEST_FREE can occur at any point in the program
after the persistent request was created. However, the request will be deallocated only after it becomes inactive. Active receive requests should not be freed . Otherwise, it will not be possible to check that the receive has completed.
I'm trying to understand how both these sentences can be correct:
"the request will be deallocated only after it becomes inactive"
"Active receive requests should not be freed"
The first sentence seems to imply that only inactive requests can be freed. But the second one appears to be saying that active (receive) requests can be freed but should not be freed.

If you do MPI_Request_free on an active request, your handle becomes null. But as point 2 notes, that is not a good idea because you can no longer check on the status of the communication.
However, by free'ing you only lose the handle: in case of a still-ongoing communication the actual request object still exists, taking up memory. That one will only be deallocated -- meaning, by MPI, not by you -- when the communication is finished.
Put it another way, MPI_Request_free indeed de-allocates the request object, but only immediately if it corresponds to an inactive request. Otherwise the deallocation happens when the request becomes inactive.

Related

what does `TCPBacklogDrop` means when using `netstat -s`

all
Recently I am debugging a problem on unix system, by using command
netstat -s
and I get an output with
$ netstat -s
// other fields
// other fields
TCPBacklogDrop: 368504
// other fields
// other fields
I have searched for a while to understand what does this field means, and got mainly two different answers:
This means that your tcp-date-receive-buffer is full, and there are some packages overflow
This means your tcp-accept-buffer is full, and there are some disconnections
Which is the correct one? any offical document to support it?
Interpretation #2 is referring to the queue of sockets waiting to be accepted, possibly because its size is set (more or less) by the value of the parameter named backlog to listen. This interpretation, however, is not correct.
To understand why interpretation #1 is correct (although incomplete), we will need to consult the source. First note that the string "TCPBacklogDrop"is associated with the Linux identifier LINUX_MIB_TCPBACKLOGDROP (see, e.g., this). This is incremented here in tcp_add_backlog.
Roughly speaking, there are 3 queues associated with the receive side of an established TCP socket. If the application is blocked on a read when a packet arrives, it will generally be sent to the prequeue for processing in user space in the application process. If it can't be put on the prequeue, and the socket is not locked, it will be placed in the receive queue. However, if the socket is locked, it will be placed in the backlog queue for subsequent processing.
If you follow through the code you will see that the call to sk_add_backlog called from tcp_add_backlog will return -ENOBUFS if the receive queue is full (including that which is in the backlog queue) and the packet will be dropped and the counter incremented. I say this interpretation is incomplete because this is not the only place where a packet could be dropped when the "receive queue" is full (which we now understand to be not as straightforward as a single queue).
I wouldn't expect such drops to be frequent and/or problematic under normal operating conditions as the sender's TCP stack should honor the advertised window of the receiver and not send data exceeding the capacity of the receive queue (with the exception of zero window probes and older kernel versions whose calculations could cause drops when the receive window was not actually full). If it is somehow indicative of a problem, I would start worrying about malicious clients (some form of DDOS maybe) or some failure causing a sockets lock to be held for an extended period of time.

MPI standard 3: when synchronous send is complete?

In the MPI Standard Section 3.4 (page 37):http://mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf
the synchronous send completion means
1. the send-buffer can be reused
2. the receiver has started to receive data.
The standard says "has started" instead of "has completed", so I have a question about this: Imagine a case:
The sender calls MPI_Ssend, then a receiver is matched and has started to receive data. At this time, the send is complete and returned. As the MPI standard said, the send-buffer can be reused, so the sender modifies some data of the send-buffer. At the same time, the receiver is receiving data very slowly (e.g. network is very bad), so how can we guarantee the data finally received by the receiver is same as the original data stored in sender's send-buffer?
Ssend is synchronous. It means that Ssend cannot return before the corresponding Recv is called.
Ssend is Blocking. It means that the function return only when it is safe to touch the "send-buffer".
Synchronous and blocking are 2 different thing, I know it can be confusing.
Most implementation of Send works as follow (MPICH,OpenMPI,CRAY-MPI):
For small message the send-buffer is copied to the memory which is reserved for MPI. As soon as the copy is done the send return.
For large message, no copy are done, therefore the Send return once the entire send-buffer has been send to the network (which cannot be done before the Revc has been called, to avoid to overload the network memory)
So a MPI_Send is: Blocking, asynchronous for small message,synchronous for large one.
A Ssend works as follow:
As soon as the Recv is started AND the send-buffer is either copied or fully in the network, the Ssend return.
Ssend should be avoided as much as one can. As it slow down the communication (due to the fact that the network need to tell the sender that the recv has started)

peer-to-peer epoll clients and deadlock

Suppose a peer-to-peer program uses epoll to perform asynchronous TCP reads from and writes to multiple peers. Naturally, this means that every file descriptor is set to nonblocking to allow epoll_wait to be called and for multiple sockets to be checked.
However, there is a potential issue. Suppose there are two peers: A and B. A tries to write a message to B, but B is congested or something and so the call to write returns -1 with errno set to EAGAIN. At this point, A goes to sleep on the call to epoll_wait.
But note that B is already stuck on its own call to epoll_wait. If B is never notified about A's failed attempt to send it a message, then B will never wake up and try to perform a read on A's socket, and the entire thing will deadlock. So my question is, is B guaranteed be notified that A is attempting to send it a message, even if A gives up on the original write call and goes to sleep?
Even if the answer to the above is "yes", is it possible for a system like this to deadlock indefinitely because of application-layer desynchronization? i.e. A tries to write to B but fails, so it goes to sleep. Then B wakes up and tries to read from A, but fails because A went to sleep. etc.
Any protocol that had a possible state where both sides are permitted to wait for the other side to read before they read would be a fundamentally broken protocol. For peer-to-peer protocols, typically each end is prohibited from delaying reads just because it cannot write.
On the implementation side, typically every call to epoll_wait (or the equivalent way you discover ready I/O) checks for input on all descriptors the program is using. Reading is never deferred unless the application has unprocessed data that it has already read and it stops deferring as soon as that data is processed. Waiting for network activity before reading is generally a very bad idea.
This is why typical protocol-neutral TCP proxies use two processes or two threads. You can't just read from A and then go do a blocking write to B because you don't know if B is required to read before it writes.
This is also why calling recv with MSG_WAITALL is almost always a bad idea. The other end might be waiting for you to receive the bytes it has already sent before it sends any more. No protocol can allow one side to wait for all the bytes to be sent before reading any of them if it also allows the other side to wait until some bytes have been read before sending the rest of them!

BizTalk Zombies - any way to explicitly REMOVE a subscription from within a BizTalk orchestration

Background:
We make use of a lot of aggregation, singleton and multiton orchestrations, similar to Seroter's Round Robin technique described here (BizTalk 2009).
All of the these orchestration types have fairly arbitrary exit or continuation points (for aggregations), usually defined by a timer - i.e. if an Orch hasn't received any more messages within X minutes then proceed with the batching, and if after Y more minutes have elapsed and no more messages then quit. (We also exit our Single / N-Tons due to concerns about degraded performance after large numbers of messages are subscribed to the singleton over a period).
As much as we've tried to mitigate against Zombies e.g. by Starting any continuation processing in an asynch refactored orchestration, there is always a point of weakness where a 'well' timed message could cause a zombie. (i.e. receiving more incoming messages correlated to the 'already completed' shapes of an orchestration),
If a message causes a zombie on one of the subscriptions, the message does not appear to be propogated to OTHER subscribers either (i.e. orchs totally decoupled from the 'zombie causing' orchestration), i.e. the zombie-causing message is not processed.
Question
So I would be very interested in seeing if anyone has another way, programmatically or otherwise, to explicitly remove a correlated subscription from a running orchestration once the orchestration has 'progressed' beyond the point where it is interested in this correlated message. (this new message would then would typically start a new orchestration with its own correlations etc)
At this point we would consider even a hack solution such as a reflected BizTalk API call or direct SQL delete against the MsgBoxDB.
No you can't explicitly remove the subscription in an Orchestration.
The subscription will be removed as the Orchestration is tearing itself down, but a message arriving at that exact instance will be routed to the Orchestration but the Orchestration will end without processing it, and that's your Zombie.
Microsoft Article about Zombies http://msdn.microsoft.com/en-us/library/bb203853.aspx
I once also had to have an receive, debatch, aggregate, send pattern. Receiving enveloped messages from multiple senders, debatching them, aggregating by intended recipient (based on two rules, number of messages or time delay, whichever occurred first).
This scenario was ripe for Zombies and when I read about them I designed it so it would not occur. This was for BizTalk 2004
I debatched the messages, and inserted them into a database. I had a stored procedure that was polled by a receive port that would work out if there was a batch to send in if there was it would trigger of an Orcherstration that would take that message and route it dynamically.
Since neither Orchestrations had to wait for a another message they could end gracefully and there would be no Zombies.

Does asynchronous receive guarantee the detection of connection failure?

From what I know, a blocking receive on a TCP socket does not always detect a connection error (due either to a network failure or to a remote-endpoint failure) by returning a -1 value or raising an IO exception: sometimes it could just hang indefinitely.
One way to manage this problem is to set a timeout for the blocking receive. In case an upper bound for the reception time is known, this bound could be set as timeout and the connection could be considered lost simply when the timeout expires; when such an upper bound is not known a priori, for example in a pub-sub system where a connection stays open to receive publications, the timeout to be set would be somewhat arbitrary but its expiration could trigger a ping/pong request to verify that the connection (and the endpoint too) is still up.
I wonder whether the use of asynchronous receive also manages the problem of detecting a connection failure. In boost::asio I would call socket::asynch_read_some() registering an handler to be asynchronously called, while in java.nio I would configure the channel as non-blocking and register it to a selector with an OP_READ interest flag. I imagine that a correct connection-failure detection would mean that, in the first case the handler would be called with a non-0 error_code, while in the second case the selector would select the faulty channel but a subsequent read() on the channel would either return -1 or throw an IOException.
Is this behaviour guaranteed with asynchronous receive, or could there be scenarios where after a connection failure, for example, in boost::asio the handler will never be called or in java.nio the selector will never select the channel?
Thank you very much.
I believe you're referring to the TCP half-open connection problem (the RFC 793 meaning of the term). Under this scenario, the receiving OS will never receive indication of the lost connection, so it will never notify the app. Whether the app is readding synchronously or asynchronously doesn't enter into it.
The problem occurs when the transmitting side of the connection somehow is no longer aware of the network connection. This can happen, for example, when
the transmitting OS abruptly terminates/restarts (power outage, OS failure/BSOD, etc.).
the transmitting side closes its side while there is a network disruption between the two sides and cleans up its side: e.g transmitting OS reboots cleanly during disruption, transmitting Windows OS is unplugged from the network
When this happens, the receiving side may be waiting for data or a FIN that will never come. Unless the receiving side sends a message, there's no way for it to realize the transmitting side is no longer aware of the receiving side.
Your solution (a timeout) is one way to address the issue, but it should include sending a message to the transmitting side. Again, it doesn't matter the read is synchronous or asynchronous, just that it doesn't read and wait indefinitely for data or a FIN. Another solution is using a TCP KEEPALIVE feature that is supported by some TCP stacks. But the hard part of any generalized solution is usually determining a proper timeout, since the timeout is highly dependent on characteristics of the specific application.
Because of how TCP works, you will typically have to send data in order to notice a hard connection failure, to find out that no ACK packet will ever be returned. Some protocols attempt to identify conditions like this by periodically using a keep-alive or ping packet: if one side does not receive such a packet in X time (and perhaps after trying and failing one itself), it can consider the connection dead.
To answer your question, blocking and non-blocking receive should perform identically except for the act of blocking itself, so both will suffer from this same issue. In order to make sure that you can detect a silent failure from the remote host, you'll have to use a form of keep-alive like I described.

Resources