MPI one-sided communication with user callbacks - asynchronous

To overlap MPI communications and computations, I am working on issuing asynchronous I/O (MPI calls) with user-defined computation function on the data from I/O.
MS-Window's 'Overlap' is not the friend of MPI (It supports for overlapped I/O only for File I/O and Socket communication, but not for MPI operations...)
I cannot find an appropriate MPI API for it, is there anyone with a glimpse on it?

There are no completion callbacks in MPI. Non-blocking operations always return a request handle that must be either be synchronously waited on using MPI_Wait and family or periodically tested using the non-blocking MPI_Test and family.
With the help of either MPI_Waitsome or MPI_Testsome, it is possible to implement a dispatch mechanism that monitors multiple requests and calls specific functions upon their completion. None of the MPI calls has any timeout characteristics though - it is either "wait forever" (MPI_Wait...) or "check without waiting" (MPI_Test...).


Does the operation time.sleep(seconds) can be considered as asynchronous I/O?

The library of asyncio in Python, and generally, when we talk about asynchronous programming, I always think about doing “concurrent” I/O operations only on the level thread for optimized CPU use.
The library of asyncio has function of asyncio.sleep(seconds), but what disturb me was that sleep operation isn’t I/O operation, sleep operation is done on the kernel level with the CPU hardware without any external devices that can be counted as I/O [my definition for I/O is every hardware except from CPU and RAM].
So why does the asyncio lib (Asynchronous I/O) call this operation as an asynchronous I/O operation?
This is not a network interface controller we send requests to or the hard disk. I don’t have a problem with “concurrent” every operation we can on the level thread. However, the name of I/O in the end of the library makes me feel that it isn’t the proper terminology. I will be happy for clarification.
One more related question, does the terminology of asynchronous programming refer to “concurrent” I/O operations only or every operation, including CPU operations like x = x + 1 on the level thread? (I guess the last operation can be done “concurrently” on the level thread, but this will be unnecessary)
Code snippet:
import asyncio
async def main():
print('Hello ...')
await asyncio.sleep(1)
print('... World!')
Paraphrasing Wikipedia, "Asynchronous programming" generally refers to the occurrence of events outside of the main program flow and ways of handling such events. As such, asynchronous operations are not necessarily I/O ones.
These asynchronous events are generally handled at the hardware or OS level and it is important to understand that at this level almost anything is asynchronous: jobs are put into queues and scheduled by the OS, then they are regularly polled for completion by the OS which then notifies the main application that the job is done.
Such asynchronous events comprises:
Network requests (multiplexed and polled by the OS),
Timers (managed by hardware timers and interrupts),
Communication with various external devices such as keyboards (hardware interrupts),
Communication with internal devices such as the GPU (jobs are committed to command queues),
The purpose of the AsyncIO library is to allow the expression of asynchronous programs in a more "structured" and linear way. As such, it wraps many common asynchronous operations such as I/Os and timers into async-await equivalents. AsyncIO is thus not restricted to only asynchronous I/O operations and one can implement an AsyncIO async-await interface to support GPU for example.

Can tasks executed Asynchronously on Serial Queue?

I am trying to understand the basic functionality of Serial Queue and Concurrent Queue in GCD.
Can we perform synchronous operations on Concurrent Queue? As I know synchronous means executing tasks one after another but how it is possible with Concurrent Queue which executes tasks in parallel? It seems contradictory to me.
Similarly, how can we perform asynchronous operation on serial queue as serial queue perform tasks one after another so how they can be executed concurrently?
If anyone can explain with the help of image then it will be very clear.
You asked:
Can we perform synchronous operations on Concurrent Queue? As I know synchronous means executing tasks one after another but how it is possible with Concurrent Queue which executes tasks in parallel?
OK, let’s consider terminology before answering your question:
What is a “synchronous operation”? It is one that will block its respective thread during that operation. But a concurrent queue can use multiple threads to perform these individual synchronous operations on that same queue at the same time, each running on its own thread.
Let us use a practical example: Consider a synchronous operation that might be an algorithm to process an image (e.g. resize it or convert a color image to black-and-white). When you perform this operation, it will generally tie up the respective thread until the operation is done.
So, given that example, yes, you can certainly can (and we often do) perform multiple concurrent synchronous operations in parallel. Using our prior example, you might have 4 images that you want to process concurrently. So you might instantiate a concurrent queue, and add these four operations to that queue, and they will be processed in parallel, each on its own “worker thread”.
You then ask:
Similarly, how can we perform asynchronous operation on serial queue as serial queue perform tasks one after another so how they can be executed concurrently?
This depends a little upon what you mean by “operation”. Are you talking about a Swift Operation (or Objective-C NSOperation) on an “operation queue”? Or are you using the term “operation” a little more generally as it applies to GCD and dispatch queues?
The reason I ask, is that in the world of GCD (aka “dispatch queues”), you simply do not “perform an asynchronous operation on a serial queue”. You start asynchronous tasks from a serial queue, but the definition of “asynchronous” means that the current thread does not wait for the task to finish (which generally means that, often behind the scenes, another queue/thread is doing the work).
A good example of that would be when you start a series of network requests from a serial queue. Hidden in NSURLSession/URLSession, it has its own queues/threads that are managing these multiple network requests concurrently. If you do not want these requests to run concurrently, some sleight of hand is required to take an API which is designed for concurrent operation and have it behave sequentially, one after the other.
This is where operation queues come into play, as they do have the concept of custom Operation/NSOperation subclasses, in which you can define an operation to wrap an asynchronous task, such that the operation does not “complete” until the asynchronous task is done. It uses KVO to notify the queue when the operation is executing, is finished, etc. In that scenario, you can define a serial operation queue (i.e., one with a maxConcurrentOperationCount of 1), add a series of your own asynchronous operation subclass instances to that queue, and it can run them sequentially, one after the other. But using operation queues with asynchronous operations can be a little complicated. If that’s really what you are trying to do, we can point you to some examples. But, in the interest of full disclosure, this operation queue pattern is used less frequently nowadays, and you will often see other patterns such as Combine, or the new async-await API, to achieve similar results.
So, we can’t answer this latter question without a little more detail of what precisely you mean by “asynchronous operation on serial queue”. Give us a practical example of what you mean (and what API you are using).

Advantage of MPI_SEND over MPI_ISEND?

Using MPI_SEND (the standard blocking send) is simpler than using MPI_ISEND (the standard non-blocking send), because the latter should be used along with another MPI function to ensure that the communication has been "completed", so that the send buffer can be reused. But apart from that, has MPI_SEND any advantages over MPI_ISEND? It seems that, in general, MPI_ISEND prevents deadlock and also allows better performance (because the calling process can do other things while the communication proceeds in the background by MPI implementation).
So, is it a good idea to use the blocking version at all?
Performance wise, MPI_Send() has the potential of being faster that MPI_Isend() immediately followed by MPI_Wait() (and it is faster in Open MPI).
But most importantly, if your MPI library does not provide a progress thread, your message might be sitting on the sender node until MPI is progressed by your code (that typically occurs when a MPI subroutine is invoked, and definitely happens when MPI_Wait() is called).

Does all asynchronous I/O ultimately implemented in polling?

I have been though asynchronous I/O is always has a callback form. But recently I discovered some low level implementations are using polling style API.
And this leads me to think that maybe all (or most) asynchronous I/O (any of file, socket, mach-port, etc.) is implemented in a kind of polling manner at last. Maybe the callback form is just an abstraction only for higher-level API.
This could be a silly question, but I don't know how actually most of asynchronous I/O implemented at low level. I just used the system level notifications, and when I see kqueue - which is the system notification, it's a polling style!
How should I understand asynchronous I/O at low-level? How the high-level asynchronous notification is being made from low-level polling system? (if it actually does)
At the lowest (or at least, lowest worth looking at) hardware level, asynchronous operations truly are asynchronous in modern operating systems.
For example, when you read a file from the disk, the operating system translates your call to read to a series of disk operations (seek to location, read blocks X through Y, etc.). On most modern OSes, these commands get written either to special registers, or special locations in main memory, and the disk controller is informed that there are operations pending. The operating system then goes on about its business, and when the disk controller has completed all of the operations assigned to it, it triggers an interrupt, causing the thread that requested the read to pickup where it left off.
Regardless of what type of low-level asynchronous operation you're looking at (disk I/O, network I/O, mouse and keyboard input, etc.), ultimately, there is some stage at which a command is dispatched to hardware, and the "callback" as it were is not executed until the hardware reaches out and informs the OS that it's done, usually in the form of an interrupt.
That's not to say that there aren't some asynchronous operations implemented using polling. One trivial (but naive and costly) way to implement any blocking operation asynchronously is just to spawn a thread that waits for the operation to complete (perhaps polling in a tight loop), and then call the callback when it's finished. Generally speaking, though, common asynchronous operations at the OS level are truly asynchronous.
It's also worth mentioning that just because an API is blocking doesn't mean it's polling: you can put a blocking API on an asynchronous operation, and a non-blocking API on a synchronous operation. With things like select and kqueues, for example, the thread actually just goes to sleep until something interesting happens. That "something interesting" comes in in the form of an interrupt (usually), and that's taken as an indication that the operating system should wake up the relevant threads to continue work. It doesn't just sit there in a tight loop waiting for something to happen.
There really is no way to tell whether a system uses polling or "real" callbacks (like interrupts) just from its API, but yes, there are asynchronous APIs that are truly backed by asynchronous operations.

What is the standard way to use MPI_Isend to send the same message to many processors?

I was originally using MPI_Send paired with MPI_Irecv, but I recently found out that MPI_Send may block until the message is received. So, I'm changing to MPI_Isend and I need to send the same message to N different processors. Assuming the buffer will get destroyed later, should I have a for loop with MPI_Isend and MPI_Wait in the loop, or should I make an array of requests and have only MPI_Isend in the loop with MPI_Waitall after the loop?
For distributing the same buffer to "n" remote ranks, MPI_Bcast is the "obvious" choice. Unless you have some "overwhelming" reason to avoid MPI_Bcast, it would be advisable to use it. In general, MPI_Bcast is very well optimized by all the major MPI implementations.
If blocking is an issue, MPI 3.0 Standard introduced MPI_IBcast along with other non-blocking collectives. The initial implementation of non-blocking collectives appears to be "naive" and built as wrappers to non-blocking point-to-point routines (e.g. MPI_IBcast is implemented as a wrapper around calls to MPI_ISend and MPI_IRecv). The implementations are likely to improve in quality over the next year or two - depending partly on the speed of adoption by the MPI application developer community.
MPI_Send will "block" until the send buffer can be safely re-used by the calling application. Nothing is guaranteed about the state of the corresponding MPI_[I]Recv's.
If you need non-blocking, then the best advice would be to call MPI_ISend in a loop. Alternatively, persistent requests could be used with MPI_Start or MPI_Startall if this is a message pattern that will be repeated over the course of the program. Persistent communication requests.
Since its the same message, you should be able to use MPI_Bcast. You'll just have to create a new communicator to define a subgroup of processes.
