Suggestions for doing async I/O with Task Parallel Library - asynchronous

I have some high performance file transfer code which I wrote in C# using the Async Programming Model (APM) idiom (eg, BeginRead/EndRead). This code reads a file from a local disk and writes it to a socket.
For best performance on modern hardware, it's important to keep more than one outstanding I/O operation in flight whenever possible. Thus, I post several BeginRead operations on the file, then when one completes, I call a BeginSend on the socket, and when that completes I do another BeginRead on the file. The details are a bit more complicated than that but at the high level that's the idea.
I've got the APM-based code working, but it's very hard to follow and probably has subtle concurrency bugs. I'd love to use TPL for this instead. I figured Task.Factory.FromAsync would just about do it, but there's a catch.
All of the I/O samples I've seen (most particularly the StreamExtensions class in the Parallel Extensions Extras) assume one read followed by one write. This won't perform the way I need.
I can't use something simple like Parallel.ForEach or the Extras extension Task.Factory.Iterate because the async I/O tasks don't spend much time on a worker thread, so Parallel just starts up another task, resulting in potentially dozens or hundreds of pending I/O operations; way too much! You can work around that by Waiting on your tasks, but that causes creation of an event handle (a kernel object), and a blocking wait on a task wait handle, which ties up a worker thread. My APM-based implementation avoids both of those things.
I've been playing around with different ways to keep multiple read/write operations in flight, and I've managed to do so using continuations that call a method that creates another task, but it feels awkward, and definitely doesn't feel like idiomatic TPL.
Has anyone else grappled with an issue like this with the TPL? Any suggestions?

If you're worried about too many threads, you can just set ParallelOptions.MaxDegreeOfParallelism to an acceptable number in your call to Parallel.ForEach.

Related

.NET/C# - What is the best option instead of an ActionBlock<T> (or Channel<T>) for speed?

corefxlab has something called a Channel which is a really nice implementation of an async P-C queue and definitely does what I'm looking for. I'm curious if there's an implementation that ultimately had a similar API to ActionBlock<T>:
Must be able to accept/deny from multiple producers.
Only needs to have one consuming task but would be preferable that it continue processing until empty. Then 'wait' for new items.
A Channel<T> is much faster than an BufferBlock<T> but I'm just curious if given the specific requirements if there was something even faster.
According to a readme by Stephen Toub, Channels might end up being the underlying implementation around some Dataflow blocks. Channels wins for P-C queue async speed.

How to use non-blocking point-to-point MPI routines instead of collectives

In my programm, I would like to heavily parallelize many mathematical calculations, the results of which are then written to an output file.
I successfully implemented that using collective communication (gather, scatter etc.) but I noticed that using these synchronizing routines, the slowest among all processors dominates the execution time and heavily reduces overall computation time, as fast processors spend a lot of time waiting.
So I decided to switch to the scheme, where one (master) processor is dedicated to receiving chunks of results and handling the file output, and alle the other processors calculate these results and send them to the master using non-blocking send routines.
Unfortunately, I don't really know how to implement the master code; Do I need to run an infinite loop with MPI_Recv(), listening for incoming messages? How do I know when to stop the loop? Can I combine MPI_Isend() and MPI_Recv(), or do both method need to be non-blocking? How is this typically done?
MPI 3.1 provides non-blocking collectives. I would strongly recommend that instead of implementing it on your own.
However, it may not help you after all. Eventually you need the data from all processes, even the slow ones. So you are likely to wait at some point again. Non-blocking communication overlaps communication and computation, but it doesn't fix your load imbalances.
Update (more or less a long clarification comment)
There are several layers to your question, I might have been confused by the title as to what kind of answer you were expecting. Maybe the question is rather
How do I implement a centralized work queue in MPI?
This pops up regularly, most recently here. But that is actually often undesirable because a central component quickly becomes a bottleneck in large scale programs. So the actual problem you have, is that your work decomposition & mapping is imbalanced. So the more fundamental "X-question" is
How do I load balance an MPI application?
At that point you must provide more information about your mathematical problem and it's current implementation. Preferably in form of an [mcve]. Again, there is no standard solution. Load balancing is a huge research area. It may even be a topic for CS.SE rather than SO.

Why should nesting of QEventLoops be avoided?

In his Qt event loop, networking and I/O API talk, Thiago Macieira mentions that nesting of QEventLoop's should be avoided:
QEventLoop is for nesting event Loops... Avoid it if you can because it creates a number of problems: things might reenter, new activations of sockets or timers that you were not expecting.
Can anybody expand on what he is referring to? I maintain a lot of code that uses modal dialogs which internally nest a new event loop when exec() is called so I'm very interested in knowing what kind of problems this may lead to.
A nested event loop costs you 1-2kb of stack. It takes up 5% of the L1 data cache on typical 32kb L1 cache CPUs, give-or-take.
It has the capacity to reenter any code already on the call stack. There are no guarantees that any of that code was designed to be reentrant. I'm talking about your code, not Qt's code. It can reenter code that has started this event loop, and unless you explicitly control this recursion, there are no guarantees that you won't eventually run out of stack space.
In current Qt, there are two places where, due to a long standing API bugs or platform inadequacies, you have to use nested exec: QDrag and platform file dialogs (on some platforms). You simply don't need to use it anywhere else. You do not need a nested event loop for non-platform modal dialogs.
Reentering the event loop is usually caused by writing pseudo-synchronous code where one laments the supposed lack of yield() (co_yield and co_await has landed in C++ now!), hides one's head in the sand and uses exec() instead. Such code typically ends up being barely palatable spaghetti and is unnecessary.
For modern C++, using the C++20 coroutines is worthwhile; there are some Qt-based experiments around, easy to build on.
There are Qt-native implementations of stackful coroutines: Skycoder42/QtCoroutings - a recent project, and the older ckamm/qt-coroutine. I'm not sure how fresh the latter code is. It looks that it all worked at some point.
Writing asynchronous code cleanly without coroutines is usually accomplished through state machines, see this answer for an example, and QP framework for an implementation different from QStateMachine.
Personal anecdote: I couldn't wait for C++ coroutines to become production-ready, and I now write asynchronous communication code in golang, and statically link that into a Qt application. Works great, the garbage collector is unnoticeable, and the code is way easier to read and write than C++ with coroutines. I had a lot of code written using C++ coroutines TS, but moved it all to golang and I don't regret it.
A nested event loop will lead to ordering inversion. (at least on qt4)
Lets say you have the following sequence of things happening
enqueued in outer loop: 1,2,3
processing 1 => spawn inner loop
enqueue 4 in inner loop
processing 4
exit inner loop
processing 2
So you see the processing order was: 1,4,2,3.
I speak from experience and this usually resulted in a crash in my code.

Parallel VS asynchronous programming

While I have tried to dive into both techniques it is still a bit blurry to me for which problems and situations these are used.
If I simplify this, are CPU-bound problems handled with parallel and IO-bound ones async programming?
Perhaps a better title for this question would be 'to block or not to block?' as going parallel or asynchronous are not mutually exclusive.
I recommend using multiple threads on a problem either 1) when it is both CPU bound, and can be split up into multiple parts that do not require coordination/sharing to complete or 2) the job may stall for a long period of time on IO and we do not want to prevent other work from occurring.
Asynchronous basically means, don't block a thread waiting for something to complete. Instead rely on a callback that will notify of its completion. As such one can go asynchronous when there is only one worker thread.
Asynchronous techniques have been resurfacing recently because they scale better than blocking techniques. This is because we are limited in how many threads we can have on a single system before the overheads of managing those threads dominate.

Why shouldn't I use F# asynchronous workflows for parallelism?

I have been learning F# recently, being particularly interested in its ease of exploiting data parallelism. The data |> Array.map |> Async.Parallel |> Async.RunSynchronously idiom seems very easy to understand and straightforward to use and get real value from.
So why is it that async is not really intended for this? Donald Syme himself says that PLINQ and Futures are probably a better choice. And other answers I've read here agree with that as well as recommending TPL. (PLINQ doesn't seem too much different to the above built-in functions, as long as you're using the F# Powerpack to get the PSeq functions.)
F# and functional languages make a lot of sense for this, and some applications have achieved great success with async parallelism.
So why shouldn't I use async to execute parallel data processes? What am I going to lose by writing parallel async code instead of using PLINQ or TPL?
So why shouldn't I use async to execute parallel data processes?
If you have a tiny number of completely independent non-async tasks and lots of cores then there is nothing wrong with using async to achieve parallelism. However, if your tasks are dependent in any way or you have more tasks than cores or you push the use of async too far into the code then you will be leaving a lot of performance on the table and could do a lot better by choosing a more appropriate foundation for parallel programming.
Note that your example can be written even more elegantly using the TPL from F# though:
Array.Parallel.map f xs
What am I going to lose by writing parallel async code instead of using PLINQ or TPL?
You lose the ability to write cache oblivious code and, consequently, will suffer from lots of cache misses and, therefore, all cores stalling waiting for shared memory which means poor scalability on a multicore.
The TPL is built upon the idea that child tasks should execute on the same core as their parent with a high probability and, therefore, will benefit from reusing the same data because it will be hot in the local CPU cache. There is no such assurance with async.
I wrote an article that re-implements one C# TPL sample using both Task and Async, which also has some comments on the difference between the two. You can find it here and there is also a more advanced async-based version.
Here is a quote from the first article that compares the two options:
The choice between the two possible implementations depends on many factors. Asynchronous workflows were designed specifically for F#, so they more naturally fit with the language. They offer better performance for I/O bound tasks and provide more convenient exception handling. Moreover, the sequential syntax is quite convenient. On the other hand, tasks are optimized for CPU bound calculations and make it easier to access the result of calculation from other places of the application without explicit caching.
I always figured it's what TPL, PLinq etc... give you over and above what Async does. (Cancellation mechanisms is the one that comes to mind.) This question has some better answers.
This article hints at a slight performance advantage to TPL, but probably not enough to be significant.

Resources