MPI Persistent Channels -- multiple outstanding requests on a single channel - mpi

I have a question about persistent p2p channels in MPI.
First, I'll explain my situation using Irecv / Wait and then will ask about persistent channels. I have a situation where I want to issue multiple Irecv commands with the same parameters before issuing the corresponding Wait.
Example:
MPI_Request reqs[3];
MPI_Irecv(params..., &req[0]);
MPI_Irecv(params..., &req[1]);
MPI_Irecv(params..., &req[2]);
...
MPI_Wait(req[0]);
MPI_Wait(req[1]);
MPI_Wait(req[2]);
where params are all the same.
Could I convert this to using the same persistent channel where I'm able to issue the MPI_Start three times on the same request that I got from MPI_Recv_init, and then call MPI_Wait three times on that same request? My guess is no, but it seems a bit odd to have three identical persistent channels.
Is there a recommended way to do this sort of thing?
Note that what I'm describing above doesn't seem to work as I desired as I'm getting this error:
MPI_Request req;
MPI_Recv_init(params..., &req);
MPI_Start(&req);
MPI_Start(&req);
This produces the error:
MPI_Start(108): MPI_Start(request=0x7fc1dede9284) failed
MPI_Start(84).: Persistent request passed to MPI_Start or MPI_Startall is already active.
Is there another structure in MPI that would serve this need, or should I just stick with an array of MPI_Reqeusts?

Related

Handling messages over TCP

I'm trying to send and receive messages over TCP using a size of each message appended before the it starts.
Say, First three bytes will be the length and later will the message:
As a small example:
005Hello003Hey002Hi
I'll be using this method to do large messages, but because the buffer size will be a constant integer say, 200 Bytes. So, there is a chance that a complete message may not be received e.g. instead of 005Hello I get 005He nor a complete length may be received e.g. I get 2 bytes of length in message.
So, to get over this problem, I'll need to wait for next message and append it to the incomplete message etc.
My question is: Am I the only one having these difficulties to appending messages to each other, appending lengths etc.. to make them complete Or is this really usually how we need to handle the individual messages on TCP? Or, if there is a better way?
What you're seeing is 100% normal TCP behavior. It is completely expected that you'll loop receiving bytes until you get a "message" (whatever that means in your context). It's part of the work of going from a low-level TCP byte stream to a higher-level concept like "message".
And "usr" is right above. There are higher level abstractions that you may have available. If they're appropriate, use them to avoid reinventing the wheel.
So, there is a chance that a complete message may not be received e.g.
instead of 005Hello I get 005He nor a complete length may be received
e.g. I get 2 bytes of length in message.
Yes. TCP gives you at least one byte per read, that's all.
Or is this really usually how we need to handle the individual messages on TCP? Or, if there is a better way?
Try using higher-level primitives. For example, BinaryReader allows you to read exactly N bytes (it will internally loop). StreamReader lets you forget this peculiarity of TCP as well.
Even better is using even more higher-level abstractions such as HTTP (request/response pattern - very common), protobuf as a serialization format or web services which automate pretty much all transport layer concerns.
Don't do TCP if you can avoid it.
So, to get over this problem, I'll need to wait for next message and append it to the incomplete message etc.
Yep, this is how things are done at the socket level code. For each socket you would like to allocate a buffer of at least the same size as kernel socket receive buffer, so that you can read the entire kernel buffer in one read/recv/resvmsg call. Reading from the socket in a loop may starve other sockets in your application (this is why they changed epoll to be level-triggered by default, because the default edge-triggered forced application writers to read in a loop).
The first incomplete message is always kept in the beginning of the buffer, reading the socket continues at the next free byte in the buffer, so that it automatically appends to the incomplete message.
Once reading is done, normally a higher level callback is called with the pointers to all read data in the buffer. That callback should consume all complete messages in the buffer and return how many bytes it has consumed (may be 0 if there is only an incomplete message). The buffer management code should memmove the remaining unconsumed bytes (if any) to the beginning of the buffer. Alternatively, a ring-buffer can be used to avoid moving those unconsumed bytes, but in this case the higher level code should be able to cope with ring-buffer iterators, which it may be not ready to. Hence keeping the buffer linear may be the most convenient option.

MPI non-blocking send and recv and mpi_iprobe() for unknown message size

In a spatially decomposed 2D domain, I need to send particles to the 8 neighbors. I know how many I'm sending but not how many I'll receive from these neighbors.
I had implemented a code with MPI_Send(), MPI_Probe() and MPI_Recv() but I realized that it caused deadlocks whenever the message was too big.
I decided to go for non-blocking communications but then I can't figure out in what order MPI_Isend, MPI_Irecv and MPI_Iprobe should be called? I definitely need to know the size my receiving buffer should be allocated to before actually calling MPI_Irecv so I'm tempted by the order MPI_Isend() then MPI_Iprobe() then MPI_Irecv(), but the problem is that MPI_Iprove() always returns a flag equal to false and I get stuck in the while loop. As far as I understand there no obligation for MPI to actually complete the send before the call to MPI_Wait(), therefore I understand that MPI_Iprobe might never return true. But if so, how does one receives an unknown size message in non-blocking MPI point-to-point communications?
You don't have to make all 3 operations non-blocking. You can use an MPI_ISEND with a regular MPI_PROBE and/or MPI_RECV. It sounds like that might be a better option for you.

A MailboxProcessor that operates with a LIFO logic

I am learning about F# agents (MailboxProcessor).
I am dealing with a rather unconventional problem.
I have one agent (dataSource) which is a source of streaming data. The data has to be processed by an array of agents (dataProcessor). We can consider dataProcessor as some sort of tracking device.
Data may flow in faster than the speed with which the dataProcessor may be able to process its input.
It is OK to have some delay. However, I have to ensure that the agent stays on top of its work and does not get piled under obsolete observations
I am exploring ways to deal with this problem.
The first idea is to implement a stack (LIFO) in dataSource. dataSource would send over the latest observation available when dataProcessor becomes available to receive and process the data. This solution may work but it may get complicated as dataProcessor may need to be blocked and re-activated; and communicate its status to dataSource, leading to a two way communication problem. This problem may boil down to a blocking queue in the consumer-producer problem but I am not sure..
The second idea is to have dataProcessor taking care of message sorting. In this architecture, dataSource will simply post updates in dataProcessor's queue. dataProcessor will use Scanto fetch the latest data available in his queue. This may be the way to go. However, I am not sure if in the current design of MailboxProcessorit is possible to clear a queue of messages, deleting the older obsolete ones. Furthermore, here, it is written that:
Unfortunately, the TryScan function in the current version of F# is
broken in two ways. Firstly, the whole point is to specify a timeout
but the implementation does not actually honor it. Specifically,
irrelevant messages reset the timer. Secondly, as with the other Scan
function, the message queue is examined under a lock that prevents any
other threads from posting for the duration of the scan, which can be
an arbitrarily long time. Consequently, the TryScan function itself
tends to lock-up concurrent systems and can even introduce deadlocks
because the caller's code is evaluated inside the lock (e.g. posting
from the function argument to Scan or TryScan can deadlock the agent
when the code under the lock blocks waiting to acquire the lock it is
already under).
Having the latest observation bounced back may be a problem.
The author of this post, #Jon Harrop, suggests that
I managed to architect around it and the resulting architecture was actually better. In essence, I eagerly Receive all messages and filter using my own local queue.
This idea is surely worth exploring but, before starting to play around with code, I would welcome some inputs on how I could structure my solution.
Thank you.
Sounds like you might need a destructive scan version of the mailbox processor, I implemented this with TPL Dataflow in a blog series that you might be interested in.
My blog is currently down for maintenance but I can point you to the posts in markdown format.
Part1
Part2
Part3
You can also check out the code on github
I also wrote about the issues with scan in my lurking horror post
Hope that helps...
tl;dr I would try this: take Mailbox implementation from FSharp.Actor or Zach Bray's blog post, replace ConcurrentQueue by ConcurrentStack (plus add some bounded capacity logic) and use this changed agent as a dispatcher to pass messages from dataSource to an army of dataProcessors implemented as ordinary MBPs or Actors.
tl;dr2 If workers are a scarce and slow resource and we need to process a message that is the latest at the moment when a worker is ready, then it all boils down to an agent with a stack instead of a queue (with some bounded capacity logic) plus a BlockingQueue of workers. Dispatcher dequeues a ready worker, then pops a message from the stack and sends this message to the worker. After the job is done the worker enqueues itself to the queue when becomes ready (e.g. before let! msg = inbox.Receive()). Dispatcher consumer thread then blocks until any worker is ready, while producer thread keeps the bounded stack updated. (bounded stack could be done with an array + offset + size inside a lock, below is too complex one)
Details
MailBoxProcessor is designed to have only one consumer. This is even commented in the source code of MBP here (search for the word 'DRAGONS' :) )
If you post your data to MBP then only one thread could take it from internal queue or stack.
In you particular use case I would use ConcurrentStack directly or better wrapped into BlockingCollection:
It will allow many concurrent consumers
It is very fast and thread safe
BlockingCollection has BoundedCapacity property that allows you to limit the size of a collection. It throws on Add, but you could catch it or use TryAdd. If A is a main stack and B is a standby, then TryAdd to A, on false Add to B and swap the two with Interlocked.Exchange, then process needed messages in A, clear it, make a new standby - or use three stacks if processing A could be longer than B could become full again; in this way you do not block and do not lose any messages, but could discard unneeded ones is a controlled way.
BlockingCollection has methods like AddToAny/TakeFromAny, which work on an arrays of BlockingCollections. This could help, e.g.:
dataSource produces messages to a BlockingCollection with ConcurrentStack implementation (BCCS)
another thread consumes messages from BCCS and sends them to an array of processing BCCSs. You said that there is a lot of data. You may sacrifice one thread to be blocking and dispatching your messages indefinitely
each processing agent has its own BCCS or implemented as an Agent/Actor/MBP to which the dispatcher posts messages. In your case you need to send a message to only one processorAgent, so you may store processing agents in a circular buffer to always dispatch a message to least recently used processor.
Something like this:
(data stream produces 'T)
|
[dispatcher's BCSC]
|
(a dispatcher thread consumes 'T and pushes to processors, manages capacity of BCCS and LRU queue)
| |
[processor1's BCCS/Actor/MBP] ... [processorN's BCCS/Actor/MBP]
| |
(process) (process)
Instead of ConcurrentStack, you may want to read about heap data structure. If you need your latest messages by some property of messages, e.g. timestamp, rather than by the order in which they arrive to the stack (e.g. if there could be delays in transit and arrival order <> creation order), you can get the latest message by using heap.
If you still need Agents semantics/API, you could read several sources in addition to Dave's links, and somehow adopt implementation to multiple concurrent consumers:
An interesting article by Zach Bray on efficient Actors implementation. There you do need to replace (under the comment // Might want to schedule this call on another thread.) the line execute true by a line async { execute true } |> Async.Start or similar, because otherwise producing thread will be consuming thread - not good for a single fast producer. However, for a dispatcher like described above this is exactly what needed.
FSharp.Actor (aka Fakka) development branch and FSharp MPB source code (first link above) here could be very useful for implementation details. FSharp.Actors library has been in a freeze for several months but there is some activity in dev branch.
Should not miss discussion about Fakka in Google Groups in this context.
I have a somewhat similar use case and for the last two days I have researched everything I could find on the F# Agents/Actors. This answer is a kind of TODO for myself to try these ideas, of which half were born during writing it.
The simplest solution is to greedily eat all messages in the inbox when one arrives and discard all but the most recent. Easily done using TryReceive:
let rec readLatestLoop oldMsg =
async { let! newMsg = inbox.TryReceive 0
match newMsg with
| None -> oldMsg
| Some newMsg -> return! readLatestLoop newMsg }
let readLatest() =
async { let! msg = inbox.Receive()
return! readLatestLoop msg }
When faced with the same problem I architected a more sophisticated and efficient solution I called cancellable streaming and described in in an F# Journal article here. The idea is to start processing messages and then cancel that processing if they are superceded. This significantly improves concurrency if significant processing is being done.

RPC semantics what exactly is the purpose

I was going through the rpc semantics, at-least-once and at-most-once semantics, how does they work?
Couldn't understand the concept of their implementation.
In both cases, the goal is to invoke the function once. However, the difference is in their failure modes. In "at-least-once", the system will retry on failure until it knows that the function was successfully invoked, while "at-most-once" will not attempt a retry (or will ensure that there is a negative acknowledgement of the invocation before retrying).
As to how these are implemented, this can vary, but the pseudo-code might look like this:
At least once:
request_received = false
while not request_received:
send RPC
wait for acknowledgement with timeout
if acknowledgment received and acknowledgement.is_successful:
request_received = true
At most once:
request_sent = false
while not request_sent:
send RPC
request_sent = true
wait for acknowledgement with timeout
if acknowledgment received and not acknowledgement.is_successful:
request_sent = false
An example case where you want to do "at-most-once" would be something like payments (you wouldn't want to accidentally bill someone's credit card twice), where an example case of "at-least-once" would be something like updating a database with a particular value (if you happen to write the same value to the database twice in a row, that really isn't going to have any effect on anything). You almost always want to use "at-least-once" for non-mutating (a.k.a. idempotent) operations; by contrast, most mutating operations (or at least ones that incrementally mutate the state and are thus dependent on the current/prior state when applying the mutation) would need "at-most-once".
I should add that it is fairly common to implement "at most once" semantics on top of an "at least once" system by including an identifier in the body of the RPC that uniquely identifies it and by ensuring on the server that each ID seen by the system is processed only once. You can think of the sequence numbers in TCP packets (ensuring the packets are delivered once and in order) as a special case of this pattern. This approach, however, can be somewhat challenging to implement correctly on distributed systems where retries of the same RPC could arrive at two separate computers running the same server software. (One technique for dealing with this is to record the transaction where the RPC is received, but then to aggregate and deduplicate these records using a centralized system before redistributing the requests inside the system for further processing; another technique is to opportunistically process the RPC, but to reconcile/restore/rollback state when synchronization between the servers eventually detects this duplication... this approach would probably not fly for payments, but it can be useful in other situations like forum posts).

What are all the differences between pipes and message queues?

What are all the differences between pipes and message queues?
Please explain both from vxworks & unix perspectives.
I think pipes are unidirectional but message queues aren't.
But don't pipes internally use message queues, then how come pipes are unidirectional but message queues are not?
What are the other differences you can think of (from design or usage or other perspectives)?
Message Queues are:
UNIDIRECTIONAL
Fixed number of entries
Each entry has a maximum size
All the queue memory (# entries * entry size) allocated at creation
Datagram-like behavior: reading an entry removes it from the queue. If you don't read the entire data, the rest is lost. For example: send a 20 byte message, but the receiver reads 10 bytes. The remaining 10 bytes are lost.
Task can only pend on a single queue using msqQReceive (there are ways to change that with alternative API)
When sending, you will pend if the queue is full (and you don't do NO_WAIT)
When receiving, you will pend if the queue is empty (and you don't do NO_WAIT)
Timeouts are supported on receive and send
Pipes
Are a layer over message Queues <--- Unidirectional!
Have a maximum number of elements and each element has maximum size
is NOT A STREAMING INTERFACE. Datagram semantics, just list message Queues
On read, WILL PEND until there is data to read
On write, WILL PEND until there is space in the underlying message queue
Can use select facility to wait on multiple pipes
That's what I can think of right now.
"VxWorks pipes differ significantly from UNIX pipes", says the vxWorks documentation, and they ain't kidding. Here's the manpages.
It looks like it would not be exaggerating much to say that the only similarity between Unix pipes and vxWorks pipes are that they're a form of IPC. The features are different, the APIs are different, and the implementations are surely very different.
I also found this difference in IPC in UNIX. It states that the difference between them is that Message queues and pipes is that the first stores/retrieves info in packets. While pipes do it character by character.
Msg queue:
Message queue: An anonymous data stream similar to the pipe, but stores
and retrieves information in packets.
Pipe
Pipe: A two-way data stream interfaced through standard input and
output and is read character by character
I also found this question here: Pipe vs msg queue
Comparison of message queues and pipes:
- ONE message queue may be used to pass data in both directions
- the message needn't to be read on first in-first out basis
but can be processed selectively instead
source: see http://www.cs.vsb.cz/grygarek/dosys/IPC.txt
MQs have kernel persistence, and can be opened by multiple processes.

Resources