Julia: understanding when task switching happens

Julia: understanding when task switching happens - asynchronous

I could not find detailed documentation about the #async macro. From the docs about parallelism I understand that there is only one system thread used inside a Julia process and there is explicit task switching going on by the help of the yieldto function - correct me if I am wrong about this.
For me it is difficult to understand when exactly these task switches happen just by looking at the code, and knowing when it happens seems crucial.
As I understand a yieldto somewhere in the code (or in some function called by the code) needs to be there to ensure that the system is not stuck with only one task.
For example when there is a read operation, inside the read there probably is a wait call and in the implementation of wait there probably is a yieldto call. I thought that without the yieldto call the code would stuck in one task; however running the following example seems to prove this hypothesis wrong.
#async begin # Task A
while true
println("A")
end
end
while true # Task B
println("B")
end
This code produces the following output
BA
BA
BA
...
It is very unclear to me where the task switching happens inside the task created by the #async macro in the code above.
How can I tell about looking at some code the points where task switching happens?

The task switch happens inside the call to println("A"), which at some point calls write(STDOUT, "A".data). Because isa(STDOUT, Base.AsyncStream) and there is no method that is more specialized, this resolves to:
write{T}(s::AsyncStream,a::Array{T}) at stream.jl:782
If you look at this method, you will notice that it calls stream_wait(ct) on the current task ct, which in turn calls wait().
(Also note that println is not atomic, because there is a potential wait between writing the arguments and the newline.)
You could of course determine when stuff like that happens by looking at all the code involved. But I don't see why you would need to know this exactly, because, when working with parallelism, you should not depend on processes not switching context anyway. If you depend on a certain execution order, synchronize explicitly.
(You already kind of noted this in your question, but let me restate it here: As a rule of thumb, when using green threads, you can expect potential context switches when doing IO, because blocking for IO is a textbook example of why green threads are useful in the first place.)

Related

Synchronous and Asynchronous code confusing definitions

I might have misunderstood or be missing something but I have to clear this issue:
Asynchronous code means code that gets executed with multiple operations at the same time, without blocking.
Synchronous code means code that gets executed one operation at a time.
But the definition of the word synchronous is occurring at the same time, so that's the other way around isn't it? Why the confusing naming? Is the definition referring to something I am not aware of?

Why should nesting of QEventLoops be avoided?

In his Qt event loop, networking and I/O API talk, Thiago Macieira mentions that nesting of QEventLoop's should be avoided:
QEventLoop is for nesting event Loops... Avoid it if you can because it creates a number of problems: things might reenter, new activations of sockets or timers that you were not expecting.
Can anybody expand on what he is referring to? I maintain a lot of code that uses modal dialogs which internally nest a new event loop when exec() is called so I'm very interested in knowing what kind of problems this may lead to.

A nested event loop costs you 1-2kb of stack. It takes up 5% of the L1 data cache on typical 32kb L1 cache CPUs, give-or-take.
It has the capacity to reenter any code already on the call stack. There are no guarantees that any of that code was designed to be reentrant. I'm talking about your code, not Qt's code. It can reenter code that has started this event loop, and unless you explicitly control this recursion, there are no guarantees that you won't eventually run out of stack space.
In current Qt, there are two places where, due to a long standing API bugs or platform inadequacies, you have to use nested exec: QDrag and platform file dialogs (on some platforms). You simply don't need to use it anywhere else. You do not need a nested event loop for non-platform modal dialogs.
Reentering the event loop is usually caused by writing pseudo-synchronous code where one laments the supposed lack of yield() (co_yield and co_await has landed in C++ now!), hides one's head in the sand and uses exec() instead. Such code typically ends up being barely palatable spaghetti and is unnecessary.
For modern C++, using the C++20 coroutines is worthwhile; there are some Qt-based experiments around, easy to build on.
There are Qt-native implementations of stackful coroutines: Skycoder42/QtCoroutings - a recent project, and the older ckamm/qt-coroutine. I'm not sure how fresh the latter code is. It looks that it all worked at some point.
Writing asynchronous code cleanly without coroutines is usually accomplished through state machines, see this answer for an example, and QP framework for an implementation different from QStateMachine.
Personal anecdote: I couldn't wait for C++ coroutines to become production-ready, and I now write asynchronous communication code in golang, and statically link that into a Qt application. Works great, the garbage collector is unnoticeable, and the code is way easier to read and write than C++ with coroutines. I had a lot of code written using C++ coroutines TS, but moved it all to golang and I don't regret it.

A nested event loop will lead to ordering inversion. (at least on qt4)
Lets say you have the following sequence of things happening
enqueued in outer loop: 1,2,3
processing 1 => spawn inner loop
enqueue 4 in inner loop
processing 4
exit inner loop
processing 2
So you see the processing order was: 1,4,2,3.
I speak from experience and this usually resulted in a crash in my code.

Tornado and concurrent.futures.Executor

I'm learning about async and Torando and struggling. First off is it possible to use executor class in Tornado?
The below example i'm creating a websocket and when receiving a message I want to run check() as another process in the background. This is a contrived example just for my learning sake. Neither the INSIDE or AFTER gets printed. Why do we need async specific packages like Motor if we have this executor class?
Also in all the example i've seen of Torando the #gen.coroutine are always done on classes that extend the tornado.web.RequestHandler in my example I'm using a tornado.websocket.WebSocketHandler can #gen.coroutine be used inside this class also?
Finally can anyone recommend a book or in-depth tutorial on this subject? I bought "Introduction to tornado" however it's a little outdated because it uses the tornado.gen.engine.
def check(msg):
time.sleep(10)
return msg
class SessionHandler(tornado.websocket.WebSocketHandler):
def open(self):
pass
def on_close(self):
pass
# not sure if i needed this decorator or not?
#tornado.web.asynchronous
def on_message(self,message):
print("INSIDE")
with concurrent.futures.ProcessPoolExecutor() as executor:
f=executor.submit(check,"a")
result = yield f
print("AFTER")

In order to use yield, you must use #tornado.gen.coroutine (or #gen.engine). #tornado.web.asynchronous is unrelated to the use of yield, and is generally only used with callback-based handlers (#asynchronous only works in regular handlers, not websockets). Change the decorator and you should see your print statements run.
Why would someone write an asynchronous library like Motor instead of using an executor like this? For performance. A thread or process pool is much more expensive (mainly in terms of memory) than doing the same thing asynchronously. There's nothing wrong with using an executor when you need a library that has no asynchronous counterpart, but it's better to use an asynchronous version if it's available (and if the performance matters enough, to write an asynchronous version when you need one).
Also note that ProcessPoolExecutor can be tricky: all arguments submitted to the executor must be picklable, and this can be expensive for large objects. I would recommend a ThreadPoolExecutor in most cases where you just want to wrap a synchronous library.

How to implement a callback/producer/consumers scheme?

I'm warming up with Clojure and started to write a few simple functions.
I'm realizing how the language is clearly well-suited for parallel computation and this got me thinking. I've got an app (written in Java but whatever) that works the following way:
one thread waits for input to come in (filesystem in my case but it could be network or whatever) and puts that input once it arrives on a queue
several consumers fetch data from that queue and process the data in parallel
The code that puts the input to be parallely processed may look like this (it's just an example):
asynchFetchInput( new MyCallBack() {
public void handle( Input input ) {
queue.put(input)
}
})
Where asynchFetchInput would spawn a Thread and then call the callback.
It's really just an example but if someone could explain how to do something similar using Clojure it would greatly help me understand the "bigger picture".

If you have to transform data, you can make it into a seq, then you can feed it to either map or pmap The latter will process it in parallel. filter and reduce are also both really useful; so you might want to see if you can express your logic in those terms.
You might also want to look into the concurrency utilities in basic Java rather than spawning your own threads.

Multitasking using setjmp, longjmp

is there a way to implement multitasking using setjmp and longjmp functions

You can indeed. There are a couple of ways to accomplish it. The difficult part is initially getting the jmpbufs which point to other stacks. Longjmp is only defined for jmpbuf arguments which were created by setjmp, so there's no way to do this without either using assembly or exploiting undefined behavior. User level threads are inherently not portable, so portability isn't a strong argument for not doing it really.
step 1
You need a place to store the contexts of different threads, so make a queue of jmpbuf stuctures for however many threads you want.
Step 2
You need to malloc a stack for each of these threads.
Step 3
You need to get some jmpbuf contexts which have stack pointers in the memory locations you just allocated. You could inspect the jmpbuf structure on your machine, find out where it stores the stack pointer. Call setjmp and then modify its contents so that the stack pointer is in one of your allocated stacks. Stacks usually grow down, so you probably want your stack pointer somewhere near the highest memory location. If you write a basic C program and use a debugger to disassemble it, and then find instructions it executes when you return from a function, you can find out what the offset ought to be. For example, with system V calling conventions on x86, you'll see that it pops %ebp (the frame pointer) and then calls ret which pops the return address off the stack. So on entry into a function, it pushes the return address and frame pointer. Each push moves the stack pointer down by 4 bytes, so you want the stack pointer to start at the high address of the allocated region, -8 bytes (as if you just called a function to get there). We will fill the 8 bytes next.
The other thing you can do is write some very small (one line) inline assembly to manipulate the stack pointer, and then call setjmp. This is actually more portable, because in many systems the pointers in a jmpbuf are mangled for security, so you can't easily modify them.
I haven't tried it, but you might be able to avoid the asm by just deliberately overflowing the stack by declaring a very large array and thus moving the stack pointer.
Step 4
You need exiting threads to return the system to some safe state. If you don't do this, and one of the threads returns, it will take the address right above your allocated stack as a return address and jump to some garbage location and likely segfault. So first you need a safe place to return to. Get this by calling setjmp in the main thread and storing the jmpbuf in a globally accessible location. Define a function which takes no arguments and just calls longjmp with the saved global jmpbuf. Get the address of that function and copy it to your allocated stacks where you left room for the return address. You can leave the frame pointer empty. Now, when a thread returns, it will go to that function which calls longjmp, and jump right back into the main thread where you called setjmp, every time.
Step 5
Right after the main thread's setjmp, you want to have some code that determines which thread to jump to next, pulling the appropriate jmpbuf off the queue and calling longjmp to go there. When there are no threads left in that queue, the program is done.
Step 6
Write a context switch function which calls setjmp and stores the current state back on the queue, and then longjmp on another jmpbuf from the queue.
Conclusion
That's the basics. As long as threads keep calling context switch, the queue keeps getting repopulated, and different threads run. When a thread returns, if there are any left to run, one is chosen by the main thread, and if none are left, the process terminates. With relatively little code you can have a pretty basic cooperative multitasking setup. There are more things you probably want to do, like implement a cleanup function to free the stack of a dead thread, etc. You can also implement preemption using signals, but that is much more difficult because setjmp doesn't save the floating point register state or the flags registers, which are necessary when the program is interrupted asynchronously.

It may be bending the rules a little, but GNU pth does this. It's possible, but you probably shouldn't try it yourself except as an academic proof-of-concept exercise, use the pth implementation if you want to do it seriously and in a remotely portable fashion -- you'll understand why when you read the pth thread creation code.
(Essentially it uses a signal handler to trick the OS into creating a fresh stack, then longjmp's out of there and keeps the stack around. It works, evidently, but it's sketchy as hell.)
In production code, if your OS supports makecontext/swapcontext, use those instead. If it supports CreateFiber/SwitchToFiber, use those instead. And be aware of the disappointing truth that one of the most compelling use of coroutines -- that is, inverting control by yielding out of event handlers called by foreign code -- is unsafe because the calling module has to be reentrant, and you generally can't prove that. This is why fibers still aren't supported in .NET...

This is a form of what is known as userspace context switching.
It's possible but error-prone, especially if you use the default implementation of setjmp and longjmp. One problem with these functions is that in many operating systems they'll only save a subset of 64-bit registers, rather than the entire context. This is often not enough, e.g. when dealing with system libraries (my experience here is with a custom implementation for amd64/windows, which worked pretty stable all things considered).
That said, if you're not trying to work with complex external codebases or event handlers, and you know what you're doing, and (especially) if you write your own version in assembler that saves more of the current context (if you're using 32-bit windows or linux this might not be necessary, if you use some versions of BSD I imagine it almost definitely is), and you debug it paying careful attention to the disassembly output, then you may be able to achieve what you want.

I did something like this for studies.
https://github.com/Kraego/STM32L476_MiniOS/blob/main/Usercode/Concurrency/scheduler.c
The context/thread switching is done by setjmp/longjmp. The difficult part was to get the allocated stack correct (see allocateStack()) this depends on your platform.
This is just a demonstration how this could work, I would never use this in production.

As was already mentioned by Sean Ogden,
longjmp() is not good for multitasking, as
it can only move the stack upward and can't
jump between different stacks. No go with that.
As mentioned by user414736, you can use getcontext/makecontext/swapcontext
functions, but the problem with those is that
they are not fully in user-space. They actually
call the sigprocmask() syscall because they switch
the signal mask as part of the context switching.
This makes swapcontext() much slower than longjmp(),
and you likely don't want the slow co-routines.
To my knowledge there is no POSIX-standard solution to
this problem, so I compiled my own from different
available sources. You can find the context-manipulating
functions extracted from libtask here:
https://github.com/dosemu2/dosemu2/tree/devel/src/base/lib/mcontext
The functions are:
getmcontext(), setmcontext(), makemcontext() and swapmcontext().
They have the similar semantic to the standard functions with similar names,
but they also mimic the setjmp() semantic in that getmcontext()
returns 1 (instead of 0) when jumped to by setmcontext().
On top of that you can use a port of libpcl, the coroutine library:
https://github.com/dosemu2/dosemu2/tree/devel/src/base/lib/libpcl
With this, it is possible to implement the fast cooperative user-space
threading. It works on linux, on i386 and x86_64 arches.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex