Why blocking I/O is happening? - asynchronous

I often heard about asynchronous I/O, which is non-blocking if there's nothing to I/O.
My question is, when we do some blocking operation, I don't see any logic in my code that block the execution. So, who cause the blocking? The operating system?
So if we want non-blocking operation, do we have to wait for the OS to provide the support? Or can we implement a non-blocking version over the blocking version?

Every code in same process(thread) is blocking as usual.
After one code done, do the next line.
You don't have to declare the blocking type.
On the other hand, if you want to non-blocking some code.
You must run in other process(thread) to do that.

Related

Advantage of MPI_SEND over MPI_ISEND?

Using MPI_SEND (the standard blocking send) is simpler than using MPI_ISEND (the standard non-blocking send), because the latter should be used along with another MPI function to ensure that the communication has been "completed", so that the send buffer can be reused. But apart from that, has MPI_SEND any advantages over MPI_ISEND? It seems that, in general, MPI_ISEND prevents deadlock and also allows better performance (because the calling process can do other things while the communication proceeds in the background by MPI implementation).
So, is it a good idea to use the blocking version at all?
Performance wise, MPI_Send() has the potential of being faster that MPI_Isend() immediately followed by MPI_Wait() (and it is faster in Open MPI).
But most importantly, if your MPI library does not provide a progress thread, your message might be sitting on the sender node until MPI is progressed by your code (that typically occurs when a MPI subroutine is invoked, and definitely happens when MPI_Wait() is called).

Abstract implementation of non-blocking MPI calls

Non-blocking sends/recvs return immediately in MPI and the operation is completed in the background. The only way I see that happening is that the current process/thread invokes/creates another process/thread and loads an image of the send/recv code into that and itself returns. Then this new process/thread completes this operation and sets a flag somewhere which the Wait/Test returns. Am I correct ?
There are two ways that progress can happen:
In a separate thread. This is usually an option in most MPI implementations (usually at configure/compile time). In this version, as you speculated, the MPI implementation has another thread that runs a separate progress engine. That thread manages all of the MPI messages and sending/receiving data. This way works well if you're not using all of the cores on your machine as it makes progress in the background without adding overhead to your other MPI calls.
Inside other MPI calls. This is the more common way of doing things and is the default for most implementations I believe. In this version, non-blocking calls are started when you initiate the call (MPI_I<something>) and are essentially added to an internal queue. Nothing (probably) happens on that call until you make another call to MPI later that actually does some blocking communication (or waits for the completion of previous non-blocking calls). When you enter that future MPI call, in addition to doing whatever you asked it to do, it will run the progress engine (the same thing that's running in a thread in version #1). Depending on what the MPI call that's supposed to be happening is doing, the progress engine may run for a while or may just run through once. For instance, if you called MPI_WAIT on an MPI_IRECV, you'll stay inside the progress engine until you receive the message that you're waiting for. If you are just doing an MPI_TEST, it might just cycle through the progress engine once and then jump back out.
More exotic methods. As Jeff mentions in his post, there are more exotic methods that depend on the hardware on which you're running. You may have a NIC that will do some magic for you in terms of moving your messages in the background or some other way to speed up your MPI calls. In general, these are very specific to the implementation and hardware on which you're running, so if you want to know more about them, you'll need to be more specific in your question.
All of this is specific to your implementation, but most of them work in some way similar to this.
Are you asking, if a separate thread for message processing is the only solution for non-blocking operations?
If so, the answer is no. I even think, many setups use a different strategy. Usually progress of the message processing is done during all MPI-Calls. I'd recommend you to have a look into this Blog entry by Jeff Squyres.
See the answer by Wesley Bland for a more complete answer.

Does all asynchronous I/O ultimately implemented in polling?

I have been though asynchronous I/O is always has a callback form. But recently I discovered some low level implementations are using polling style API.
kqueue
libpq
And this leads me to think that maybe all (or most) asynchronous I/O (any of file, socket, mach-port, etc.) is implemented in a kind of polling manner at last. Maybe the callback form is just an abstraction only for higher-level API.
This could be a silly question, but I don't know how actually most of asynchronous I/O implemented at low level. I just used the system level notifications, and when I see kqueue - which is the system notification, it's a polling style!
How should I understand asynchronous I/O at low-level? How the high-level asynchronous notification is being made from low-level polling system? (if it actually does)
At the lowest (or at least, lowest worth looking at) hardware level, asynchronous operations truly are asynchronous in modern operating systems.
For example, when you read a file from the disk, the operating system translates your call to read to a series of disk operations (seek to location, read blocks X through Y, etc.). On most modern OSes, these commands get written either to special registers, or special locations in main memory, and the disk controller is informed that there are operations pending. The operating system then goes on about its business, and when the disk controller has completed all of the operations assigned to it, it triggers an interrupt, causing the thread that requested the read to pickup where it left off.
Regardless of what type of low-level asynchronous operation you're looking at (disk I/O, network I/O, mouse and keyboard input, etc.), ultimately, there is some stage at which a command is dispatched to hardware, and the "callback" as it were is not executed until the hardware reaches out and informs the OS that it's done, usually in the form of an interrupt.
That's not to say that there aren't some asynchronous operations implemented using polling. One trivial (but naive and costly) way to implement any blocking operation asynchronously is just to spawn a thread that waits for the operation to complete (perhaps polling in a tight loop), and then call the callback when it's finished. Generally speaking, though, common asynchronous operations at the OS level are truly asynchronous.
It's also worth mentioning that just because an API is blocking doesn't mean it's polling: you can put a blocking API on an asynchronous operation, and a non-blocking API on a synchronous operation. With things like select and kqueues, for example, the thread actually just goes to sleep until something interesting happens. That "something interesting" comes in in the form of an interrupt (usually), and that's taken as an indication that the operating system should wake up the relevant threads to continue work. It doesn't just sit there in a tight loop waiting for something to happen.
There really is no way to tell whether a system uses polling or "real" callbacks (like interrupts) just from its API, but yes, there are asynchronous APIs that are truly backed by asynchronous operations.

What is the standard way to use MPI_Isend to send the same message to many processors?

I was originally using MPI_Send paired with MPI_Irecv, but I recently found out that MPI_Send may block until the message is received. So, I'm changing to MPI_Isend and I need to send the same message to N different processors. Assuming the buffer will get destroyed later, should I have a for loop with MPI_Isend and MPI_Wait in the loop, or should I make an array of requests and have only MPI_Isend in the loop with MPI_Waitall after the loop?
For distributing the same buffer to "n" remote ranks, MPI_Bcast is the "obvious" choice. Unless you have some "overwhelming" reason to avoid MPI_Bcast, it would be advisable to use it. In general, MPI_Bcast is very well optimized by all the major MPI implementations.
If blocking is an issue, MPI 3.0 Standard introduced MPI_IBcast along with other non-blocking collectives. The initial implementation of non-blocking collectives appears to be "naive" and built as wrappers to non-blocking point-to-point routines (e.g. MPI_IBcast is implemented as a wrapper around calls to MPI_ISend and MPI_IRecv). The implementations are likely to improve in quality over the next year or two - depending partly on the speed of adoption by the MPI application developer community.
MPI_Send will "block" until the send buffer can be safely re-used by the calling application. Nothing is guaranteed about the state of the corresponding MPI_[I]Recv's.
If you need non-blocking, then the best advice would be to call MPI_ISend in a loop. Alternatively, persistent requests could be used with MPI_Start or MPI_Startall if this is a message pattern that will be repeated over the course of the program. Persistent communication requests.
Since its the same message, you should be able to use MPI_Bcast. You'll just have to create a new communicator to define a subgroup of processes.

How can a LuaSocket server handle several requests simultaneously?

The problem is the inability of my Lua server to accept multiple request simultaneously.
I attempted to make each client message be processed in its on coroutine, but this seems to have failed.
while true do
local client = server:accept()
coroutine.resume(coroutine.create( function()
GiveMessage( client )
end ) )
end
This code seems to not actually accept more than one client message at the same time. What is wrong with this method? Thank you for helping.
You will not be able to create true simultaneous handling with coroutines only — coroutines are for cooperative multitasking. Only one coroutine is executed at the same time.
The code that you've wrote is no different from calling GiveMessage() in a loop directly. You need to write a coroutine dispatcher and find a sensible reason to yield from GiveMessage() for that approach to work.
There are least three solutions, depending on the specifics of your task:
Spawn several forks of your server, handle operations in coroutines in each fork. Control coroutines either with Copas or with lua-ev or with home-grown dispatcher, nothing wrong with that. I recommend this way.
Use Lua states instead of coroutines, keep a pool of states, pool of worker OS threads and a queue of tasks. Execute each task in a free Lua state with a free worker thread. Requires some low-level coding and is messier.
Look for existing more specialized solutions — there are several, but to advice on that I need to know better what kind of server you're writing.
Whatever you choose, avoid using single Lua state from several threads at the same time. (It is possible, with the right amount of coding, but a bad idea.)
AFAIK coroutines don't play nice with luaSocket out-of-the-box. But there is Copas you can use.

Resources