MPI peer-to-peer communications: order of messages - mpi

I've read that MPI preserves message order. But the exact meaning of this is not clear to me.
Suppose that process A sends two messages (message 1 and then message 2) with different tags to process B in buffered mode. Process B posts two matching receive calls in inverted order. Is it possible that message 2 be received sooner than message 1?


what does `TCPBacklogDrop` means when using `netstat -s`

Recently I am debugging a problem on unix system, by using command
netstat -s
and I get an output with
$ netstat -s
// other fields
// other fields
TCPBacklogDrop: 368504
// other fields
// other fields
I have searched for a while to understand what does this field means, and got mainly two different answers:
This means that your tcp-date-receive-buffer is full, and there are some packages overflow
This means your tcp-accept-buffer is full, and there are some disconnections
Which is the correct one? any offical document to support it?
Interpretation #2 is referring to the queue of sockets waiting to be accepted, possibly because its size is set (more or less) by the value of the parameter named backlog to listen. This interpretation, however, is not correct.
To understand why interpretation #1 is correct (although incomplete), we will need to consult the source. First note that the string "TCPBacklogDrop"is associated with the Linux identifier LINUX_MIB_TCPBACKLOGDROP (see, e.g., this). This is incremented here in tcp_add_backlog.
Roughly speaking, there are 3 queues associated with the receive side of an established TCP socket. If the application is blocked on a read when a packet arrives, it will generally be sent to the prequeue for processing in user space in the application process. If it can't be put on the prequeue, and the socket is not locked, it will be placed in the receive queue. However, if the socket is locked, it will be placed in the backlog queue for subsequent processing.
If you follow through the code you will see that the call to sk_add_backlog called from tcp_add_backlog will return -ENOBUFS if the receive queue is full (including that which is in the backlog queue) and the packet will be dropped and the counter incremented. I say this interpretation is incomplete because this is not the only place where a packet could be dropped when the "receive queue" is full (which we now understand to be not as straightforward as a single queue).
I wouldn't expect such drops to be frequent and/or problematic under normal operating conditions as the sender's TCP stack should honor the advertised window of the receiver and not send data exceeding the capacity of the receive queue (with the exception of zero window probes and older kernel versions whose calculations could cause drops when the receive window was not actually full). If it is somehow indicative of a problem, I would start worrying about malicious clients (some form of DDOS maybe) or some failure causing a sockets lock to be held for an extended period of time.

peer-to-peer epoll clients and deadlock

Suppose a peer-to-peer program uses epoll to perform asynchronous TCP reads from and writes to multiple peers. Naturally, this means that every file descriptor is set to nonblocking to allow epoll_wait to be called and for multiple sockets to be checked.
However, there is a potential issue. Suppose there are two peers: A and B. A tries to write a message to B, but B is congested or something and so the call to write returns -1 with errno set to EAGAIN. At this point, A goes to sleep on the call to epoll_wait.
But note that B is already stuck on its own call to epoll_wait. If B is never notified about A's failed attempt to send it a message, then B will never wake up and try to perform a read on A's socket, and the entire thing will deadlock. So my question is, is B guaranteed be notified that A is attempting to send it a message, even if A gives up on the original write call and goes to sleep?
Even if the answer to the above is "yes", is it possible for a system like this to deadlock indefinitely because of application-layer desynchronization? i.e. A tries to write to B but fails, so it goes to sleep. Then B wakes up and tries to read from A, but fails because A went to sleep. etc.
Any protocol that had a possible state where both sides are permitted to wait for the other side to read before they read would be a fundamentally broken protocol. For peer-to-peer protocols, typically each end is prohibited from delaying reads just because it cannot write.
On the implementation side, typically every call to epoll_wait (or the equivalent way you discover ready I/O) checks for input on all descriptors the program is using. Reading is never deferred unless the application has unprocessed data that it has already read and it stops deferring as soon as that data is processed. Waiting for network activity before reading is generally a very bad idea.
This is why typical protocol-neutral TCP proxies use two processes or two threads. You can't just read from A and then go do a blocking write to B because you don't know if B is required to read before it writes.
This is also why calling recv with MSG_WAITALL is almost always a bad idea. The other end might be waiting for you to receive the bytes it has already sent before it sends any more. No protocol can allow one side to wait for all the bytes to be sent before reading any of them if it also allows the other side to wait until some bytes have been read before sending the rest of them!

Schemes for streaming data with BLE GATT characteristics

The GATT architecture of BLE lends itself to small fixed pieces of data (20 bytes max per characteristic). But in some cases, you end up wanting to “stream” some arbitrary length of data, that is greater than 20 bytes. For example, a firmware upgrade, even if you know its slow.
I’m curious what scheme others have used if any, to “stream” data (even if small and slow) over BLE characteristics.
I’ve used two different schemes to date:
One was to use a control characteristic, where the receiving device notify the sending device how much data it had received, and the sending device then used that to trigger the next write (I did both with_response, and without_response) on a different characteristic.
Another scheme I did recently, was to basically chunk the data into 19 byte segments, where the first byte indicates the number of packets to follow, when it hits 0, that clues the receiver that all of the recent updates can be concatenated and processed as a single packet.
The kind of answer I'm looking for, is an overview of how someone with experience has implemented a decent schema for doing this. And can justify why what they did is the best (or at least better) solution.
After some review of existing protocols, I ended up designing a protocol for over-the-air update of my BLE peripherals.
Design assumptions
we cannot predict stack behavior (protocol will be used with all our products, whatever the chip used and the vendor stack, either on peripheral side or on central side, potentially unknown yet),
use standard GATT service,
avoid L2CAP fragmentation,
assume packets get queued before TX,
assume there may be some dropped packets (even if stacks should not),
avoid unnecessary packet round-trips,
put code complexity on central side,
assume 4.2 enhancements are unavailable.
1 implies 2-5, 6 is a performance requirement, 7 is optimization, 8 is portability.
Overall design
After discovery of service and reading a few read-only characteristics to check compatibility of device with image to be uploaded, all upload takes place between two characteristics:
payload (write only, without response),
status (notifiable).
The whole firmware image is sent in chunks through the payload characteristic.
Payload is a 20-byte characteristic: 4-byte chunk offset, plus 16-byte data chunk.
Status notifications tell whether there is an error condition or not, and next expected payload chunk offset. This way, uploader can tell whether it may go on speculatively, sending its chunks from its own offset, or if it should resume from offset found in status notification.
Status updates are sent for two main reasons:
when all goes well (payloads flying in, in order), at a given rate (like 4Hz, not on every packet),
on error (out of order, after some time without payload received, etc.), with the same given rate (not on every erroneous packet either).
Receiver expects all chunks in order, it does no reordering. If a chunk is out of order, it gets dropped, and an error status notification is pushed.
When a status comes in, it acknowledges all chunks with smaller offsets implicitly.
Lastly, there is a transmit window on the sender side, where many successful acknowledges flying allow sender to enlarge its window (send more chunks ahead of matching acknowledge). Window is reduced if errors happen, dropped chunks probably are because of a queue overflow somewhere.
Using "one way" PDUs (write without response and notification) is to avoid 6. above, as ATT protocol explicitly tells acknowledged PDUs (write, indications) must not be pipelined (i.e. you may not send next PDU until you received response).
Status, containing the last received chunk, palliates 5.
To abide 2. and 3., payload is a 20-byte characteristic write. 4+16 has numerous advantages, one being the offset validation with a 16-byte chunk only involves shifts, another is that chunks are always page-aligned in target flash (better for 7.).
To cope with 4., more than one chunk is sent before receiving status update, speculating it will be correctly received.
This protocol has the following features:
it adapts to radio conditions,
it adapts to queues on sender side,
there is no status flooding from target,
queues are kept filled, this allows the whole central stack to use every possible TX opportunity.
Some parameters are out of this protocol:
central should enforce short connection interval (try to enforce it in the updater app);
slave PHY should be well-behaved with slave latency (YMMV, test your vendor's stack);
you should probably compress your payload to reduce transfer time.
15% compression,
a device connected with connectionInterval = 10ms,
a master PHY limiting every connection event to 4-5 TX packets,
average radio conditions.
I get 3.8 packets per connection event on average, i.e. ~6 kB/s of useful payload after packet loss, protocol overhead, etc.
This way, upload of a 60 kB image is done in less than 10 seconds, the whole process (connection, discovery, transfer, image verification, decompression, flashing, reboot) under 20 seconds.
It depends a bit on what kind of central device you have.
Generally, Write Without Response is the way to stream data over BLE.
Packets being received out-of-order should not happen since BLE's link layer never sends the next packet before it the previous one has been acknowledged.
For Android it's very easy: just use Write Without Response to send all packets, one after another. Once you get the onCharacteristicWrite you send the next packet. That way Android automatically queues up the packets and it also has its own mechanism for flow control. When all its buffers are filled up, the onCharacteristicWrite will be called when there is space again.
iOS is not that smart however. If you send a lot of Write Without Response packets and the internal buffers are full, iOS will silently drop new packets. There are two ways around this, either implement some (maybe complex) protocol for the peripheral notifying the status of the transmission, like Nipos answer. An easier way however is to send each 10th packet or so as a Write With Response, the rest as Write Without Response. That way iOS will queue up all packets for you and not drop the Write Without Response packets. The only downside is that the Write With Response packets require one round-trip. This scheme should nevertheless give you high throughput.

Preventing Dehydrated instances when using Parallel Convoy Correlation and messages are missing

I have an orchestration that gets activated by 1 of 2 types of messages coming in on a parallel shape. The messages are correlated by ID and status and then the remainder of the orchestration gets executed (and the messages get combined into 1).
I would like to devise a way to prevent dehydrated instances of the orchestrations from happening when one of the 2 messages does not come through. So basically, 1 message comes in and the other doesn't, the orchestration instance gets dehydrated while waiting for the second message.
I've been doing a bunch of searching and found a few decent ways to do it if this was serial convoy, but it's not and the order of the messages cannot be guaranteed.
For example, this post is very helpful in terms of serial convoys, but still does not satisfy my requirements.
I tried using a listen shape with each of the messages on its own branch and a delay on the third branch, but learned that if you are activating with a listen, all branches must activate and since the delay shape cannot activate an orchestration, it will not compile.
Any suggestions, or should I just give up and go for making a separate database in order to correlate the messages manually using pipelines?
Based on your description, the title of your message is slightly inaccurate. Dehydration is not the problem, the missing message is.
What you need to do is wrap the Receives in a Scope Shape with a Timeout set. Then, if the other Message does not arrive within the Timeout, a Timeout Exception will be raised, which you can handle and take the appropriate action.
Otherwise, the Parallel Shape will essentially wait forever for the other Message.

WSAECONNABORTED when using recv for the second time

I am writing a 2D multiplayer game consisting of two applications, a console server and windowed client. So far, the client has a FD_SET which is filled with connected clients, a list of my game object pointers and some other things. In the main(), I initialize listening on a socket and create three threads, one for accepting incoming connections and placing them within the FD_SET, another one for processing objects' location, velocity and acceleration and flagging them (if needed) as the ones that have to be updated on the client. The third thread uses the send() function to send update info of every object (iterating through the list of object pointers). Such a packet consists of an operation code, packet size & the actual data. On the client I parse it, by reading first 5 bytes (the opcode and packet size) which are received correctly, but when I want to read the remaining part of the packet (since I now know the size of it), I get a WSAECONNABORTED (error code 10053). I've read about this error, but can't see why it occurs in my application. Any help would be appreciated.
The error means the system closed the socket. This could be because it detected that the client disconnected, or because it was sending more data than you were reading.
A parser for network protocols typcally needs a lot of work to make it robust, and you can't tell how much data you will get in a single read(), e.g. you may get more than your operation code and packet size in the first chunk you read, you might even get less (e.g. only the operation code). Double check this isn't happening in your failure case.
