Multitasking using setjmp, longjmp - setjmp

is there a way to implement multitasking using setjmp and longjmp functions

You can indeed. There are a couple of ways to accomplish it. The difficult part is initially getting the jmpbufs which point to other stacks. Longjmp is only defined for jmpbuf arguments which were created by setjmp, so there's no way to do this without either using assembly or exploiting undefined behavior. User level threads are inherently not portable, so portability isn't a strong argument for not doing it really.
step 1
You need a place to store the contexts of different threads, so make a queue of jmpbuf stuctures for however many threads you want.
Step 2
You need to malloc a stack for each of these threads.
Step 3
You need to get some jmpbuf contexts which have stack pointers in the memory locations you just allocated. You could inspect the jmpbuf structure on your machine, find out where it stores the stack pointer. Call setjmp and then modify its contents so that the stack pointer is in one of your allocated stacks. Stacks usually grow down, so you probably want your stack pointer somewhere near the highest memory location. If you write a basic C program and use a debugger to disassemble it, and then find instructions it executes when you return from a function, you can find out what the offset ought to be. For example, with system V calling conventions on x86, you'll see that it pops %ebp (the frame pointer) and then calls ret which pops the return address off the stack. So on entry into a function, it pushes the return address and frame pointer. Each push moves the stack pointer down by 4 bytes, so you want the stack pointer to start at the high address of the allocated region, -8 bytes (as if you just called a function to get there). We will fill the 8 bytes next.
The other thing you can do is write some very small (one line) inline assembly to manipulate the stack pointer, and then call setjmp. This is actually more portable, because in many systems the pointers in a jmpbuf are mangled for security, so you can't easily modify them.
I haven't tried it, but you might be able to avoid the asm by just deliberately overflowing the stack by declaring a very large array and thus moving the stack pointer.
Step 4
You need exiting threads to return the system to some safe state. If you don't do this, and one of the threads returns, it will take the address right above your allocated stack as a return address and jump to some garbage location and likely segfault. So first you need a safe place to return to. Get this by calling setjmp in the main thread and storing the jmpbuf in a globally accessible location. Define a function which takes no arguments and just calls longjmp with the saved global jmpbuf. Get the address of that function and copy it to your allocated stacks where you left room for the return address. You can leave the frame pointer empty. Now, when a thread returns, it will go to that function which calls longjmp, and jump right back into the main thread where you called setjmp, every time.
Step 5
Right after the main thread's setjmp, you want to have some code that determines which thread to jump to next, pulling the appropriate jmpbuf off the queue and calling longjmp to go there. When there are no threads left in that queue, the program is done.
Step 6
Write a context switch function which calls setjmp and stores the current state back on the queue, and then longjmp on another jmpbuf from the queue.
Conclusion
That's the basics. As long as threads keep calling context switch, the queue keeps getting repopulated, and different threads run. When a thread returns, if there are any left to run, one is chosen by the main thread, and if none are left, the process terminates. With relatively little code you can have a pretty basic cooperative multitasking setup. There are more things you probably want to do, like implement a cleanup function to free the stack of a dead thread, etc. You can also implement preemption using signals, but that is much more difficult because setjmp doesn't save the floating point register state or the flags registers, which are necessary when the program is interrupted asynchronously.

It may be bending the rules a little, but GNU pth does this. It's possible, but you probably shouldn't try it yourself except as an academic proof-of-concept exercise, use the pth implementation if you want to do it seriously and in a remotely portable fashion -- you'll understand why when you read the pth thread creation code.
(Essentially it uses a signal handler to trick the OS into creating a fresh stack, then longjmp's out of there and keeps the stack around. It works, evidently, but it's sketchy as hell.)
In production code, if your OS supports makecontext/swapcontext, use those instead. If it supports CreateFiber/SwitchToFiber, use those instead. And be aware of the disappointing truth that one of the most compelling use of coroutines -- that is, inverting control by yielding out of event handlers called by foreign code -- is unsafe because the calling module has to be reentrant, and you generally can't prove that. This is why fibers still aren't supported in .NET...

This is a form of what is known as userspace context switching.
It's possible but error-prone, especially if you use the default implementation of setjmp and longjmp. One problem with these functions is that in many operating systems they'll only save a subset of 64-bit registers, rather than the entire context. This is often not enough, e.g. when dealing with system libraries (my experience here is with a custom implementation for amd64/windows, which worked pretty stable all things considered).
That said, if you're not trying to work with complex external codebases or event handlers, and you know what you're doing, and (especially) if you write your own version in assembler that saves more of the current context (if you're using 32-bit windows or linux this might not be necessary, if you use some versions of BSD I imagine it almost definitely is), and you debug it paying careful attention to the disassembly output, then you may be able to achieve what you want.

I did something like this for studies.
https://github.com/Kraego/STM32L476_MiniOS/blob/main/Usercode/Concurrency/scheduler.c
The context/thread switching is done by setjmp/longjmp. The difficult part was to get the allocated stack correct (see allocateStack()) this depends on your platform.
This is just a demonstration how this could work, I would never use this in production.

As was already mentioned by Sean Ogden,
longjmp() is not good for multitasking, as
it can only move the stack upward and can't
jump between different stacks. No go with that.
As mentioned by user414736, you can use getcontext/makecontext/swapcontext
functions, but the problem with those is that
they are not fully in user-space. They actually
call the sigprocmask() syscall because they switch
the signal mask as part of the context switching.
This makes swapcontext() much slower than longjmp(),
and you likely don't want the slow co-routines.
To my knowledge there is no POSIX-standard solution to
this problem, so I compiled my own from different
available sources. You can find the context-manipulating
functions extracted from libtask here:
https://github.com/dosemu2/dosemu2/tree/devel/src/base/lib/mcontext
The functions are:
getmcontext(), setmcontext(), makemcontext() and swapmcontext().
They have the similar semantic to the standard functions with similar names,
but they also mimic the setjmp() semantic in that getmcontext()
returns 1 (instead of 0) when jumped to by setmcontext().
On top of that you can use a port of libpcl, the coroutine library:
https://github.com/dosemu2/dosemu2/tree/devel/src/base/lib/libpcl
With this, it is possible to implement the fast cooperative user-space
threading. It works on linux, on i386 and x86_64 arches.

Related

Could you implement async-await by memcopying stack frames rather than creating state machines?

I am trying to understand all the low-level stuff Compilers / Interpreters / the Kernel do for you (because I'm yet another person who thinks they could design a language that's better than most others)
One of the many things that sparked my curiosity is Async-Await.
I've checked the under-the-hood implementation for a couple languages, including C# (the compiler generates the state machine from sugar code) and Rust (where the state machine has to be implemented manually from the Future trait), and they all implement Async-Await using state machines.
I've not found anything useful by googling ("async copy stack frame" and variations) or in the "Similar questions" section.
To me, this method seems rather complicated and overhead-heavy;
Could you not implement Async-Await by simply memcopying the stack frames of async calls to/from heap?
I'm aware that it is architecturally impossible for some languages (I thank the CLR can't do it, so C# can't either).
Am I missing something that makes this logically impossible? I would expect less complicated code and a performance boost from doing it that way, am I mistaken? I suppose when you have a deep stack hierarchy after a async call (eg. a recursive async function) the amount of data you would have to memcopy is rather large, but there are probably ways to work around that.
If this is possible, then why isn't it done anywhere?
Yes, an alternative to converting code into state machines is copying stacks around. This is the way that the go language does it now, and the way that Java will do it when Project Loom is released.
It's not an easy thing to do for real-world languages.
It doesn't work for C and C++, for example, because those languages let you make pointers to things on the stack. Those pointers can be used by other threads, so you can't move the stack away, and even if you could, you would have to copy it back into exactly the same place.
For the same reason, it doesn't work when your program calls out to the OS or native code and gets called back in the same thread, because there's a portion of the stack you don't control. In Java, project Loom's 'virtual threads' will not release the thread as long as there's native code on the stack.
Even in situations where you can move the stack, it requires dedicated support in the runtime environment. The stack can't just be copied into a byte array. It has to be copied off in a representation that allows the garbage collector to recognize all the pointers in it. If C# were to adopt this technique, for example, it would require significant extensions to the common language runtime, whereas implementing state machines can be accomplished entirely within the C# compiler.
I would first like to begin by saying that this answer is only meant to serve as a starting point to go in the actual direction of your exploration. This includes various pointers and building up on the work of various other authors
I've checked the under-the-hood implementation for a couple languages, including C# (the compiler generates the state machine from sugar code) and Rust (where the state machine has to be implemented manually from the Future trait), and they all implement Async-Await using state machines
You understood correctly that the Async/Await implementation for C# and Rust use state machines. Let us understand now as to why are those implementations chosen.
To put the general structure of stack frames in very simple terms, whatever we put inside a stack frame are temporary allocations which are not going to outlive the method which resulted in the addition of that stack frame (including, but not limited to local variables). It also contains the information of the continuation, ie. the address of the code that needs to be executed next (in other words, the control has to return to), within the context of the recently called method. If this is a case of synchronous execution, the methods are executed one after the other. In other words, the caller method is suspended until the called method finishes execution. This, from a stack perspective fits in intuitively. If we are done with the execution of a called method, the control is returned to the caller and the stack frame can be popped off. It is also cheap and efficient from a perspective of the hardware that is running this code as well (hardware is optimised for programming with stacks).
In the case of asynchronous code, the continuation of a method might have to trigger several other methods that might get called from within the continuation of callers. Take a look at this answer, where Eric Lippert outlines the entirety of how the stack works for an asynchronous flow. The problem with asynchronous flow is that, the method calls do not exactly form a stack and trying to handle them like pure stacks may get extremely complicated. As Eric says in the answer, that is why C# uses graph of heap-allocated tasks and delegates that represents a workflow.
However, if you consider languages like Go, the asynchrony is handled in a different way altogether. We have something called Goroutines and here is no need for await statements in Go. Each of these Goroutines are started on their own threads that are lightweight (each of them have their own stacks, which defaults to 8KB in size) and the synchronization between each of them is achieved through communication through channels. These lightweight threads are capable of waiting asynchronously for any read operation to be performed on the channel and suspend themselves. The earlier implementation in Go is done using the SplitStacks technique. This implementation had its own problems as listed out here and replaced by Contigious Stacks. The article also talks about the newer implementation.
One important thing to note here is that it is not just the complexity involved in handling the continuation between the tasks that contribute to the approach chosen to implement Async/Await, there are other factors like Garbage Collection that play a role. GC process should be as performant as possible. If we move stacks around, GC becomes inefficient because accessing an object then would require thread synchronization.
Could you not implement Async-Await by simply memcopying the stack frames of async calls to/from heap?
In short, you can. As this answer states here, Chicken Scheme uses a something similar to what you are exploring. It begins by allocating everything on the stack and move the stack values to heap when it becomes too large for the GC activities (Chicken Scheme uses Generational GC). However, there are certain caveats with this kind of implementation. Take a look at this FAQ of Chicken Scheme. There is also lot of academic research in this area (linked in the answer referred to in the beginning of the paragraph, which I shall summarise under further readings) that you may want to look at.
Further Reading
Continuation Passing Style
call-with-current-continuation
The classic SICP book
This answer (contains few links to academic research in this area)
TLDR
The decision of which approach to be taken is subjective to factors that affect the overall usability and performance of the language. State Machines are not the only way to implement the Async/Await functionality as done in C# and Rust. Few languages like Go implement a Contigious Stack approach coordinated over channels for asynchronous operations. Chicken Scheme allocates everything on the stack and moves the recent stack value to heap in case it becomes heavy for its GC algorithm's performance. Moving stacks around has its own set of implications that affect garbage collection negatively. Going through the research done in this space will help you understand the advancements and rationale behind each of the approaches. At the same time, you should also give a thought to how you are planning on designing/implementing the other parts of your language for it be anywhere close to be usable in terms of performance and overall usability.
PS: Given the length of this answer, will be happy to correct any inconsistencies that may have crept in.
I have been looking into various strategies for doing this myseøf, because I naturally thi k I can design a language better than anybody else - same as you. I just want to emphasize that when I say better, I actually mean better as in tastes better for my liking, and not objectively better.
I have come to a few different approaches, and to summarize: It really depends on many other design choices you have made in the language.
It is all about compromises; each approach has advantages and disadvantages.
It feels like the compiler design community are still very focused on garbage collection and minimizing memory waste, and perhaps there is room for some innovation for more lazy and less purist language designers given the vast resources available to modern computers?
How about not having a call stack at all?
It is possible to implement a language without using a call stack.
Pass continuations. The function currently running is responsible for keeping and resuming the state of the caller. Async/await and generators come naturally.
Preallocated static memory addresses for all local variables in all declared functions in the entire program. This approach causes other problems, of course.
If this is your design, then asymc functions seem trivial
Tree shaped stack
With a tree shaped stack, you can keep all stack frames until the function is completely done. It does not matter if you allow progress on any ancestor stack frame, as long as you let the async frame live on until it is no longer needed.
Linear stack
How about serializing the function state? It seems like a variant of continuations.
Independent stack frames on the heap
Simply treat invocations like you treat other pointers to any value on the heap.
All of the above are trivialized approaches, but one thing they have in common related to your question:
Just find a way to store any locals needed to resume the function. And don't forget to store the program counter in the stack frame as well.

Why should nesting of QEventLoops be avoided?

In his Qt event loop, networking and I/O API talk, Thiago Macieira mentions that nesting of QEventLoop's should be avoided:
QEventLoop is for nesting event Loops... Avoid it if you can because it creates a number of problems: things might reenter, new activations of sockets or timers that you were not expecting.
Can anybody expand on what he is referring to? I maintain a lot of code that uses modal dialogs which internally nest a new event loop when exec() is called so I'm very interested in knowing what kind of problems this may lead to.
A nested event loop costs you 1-2kb of stack. It takes up 5% of the L1 data cache on typical 32kb L1 cache CPUs, give-or-take.
It has the capacity to reenter any code already on the call stack. There are no guarantees that any of that code was designed to be reentrant. I'm talking about your code, not Qt's code. It can reenter code that has started this event loop, and unless you explicitly control this recursion, there are no guarantees that you won't eventually run out of stack space.
In current Qt, there are two places where, due to a long standing API bugs or platform inadequacies, you have to use nested exec: QDrag and platform file dialogs (on some platforms). You simply don't need to use it anywhere else. You do not need a nested event loop for non-platform modal dialogs.
Reentering the event loop is usually caused by writing pseudo-synchronous code where one laments the supposed lack of yield() (co_yield and co_await has landed in C++ now!), hides one's head in the sand and uses exec() instead. Such code typically ends up being barely palatable spaghetti and is unnecessary.
For modern C++, using the C++20 coroutines is worthwhile; there are some Qt-based experiments around, easy to build on.
There are Qt-native implementations of stackful coroutines: Skycoder42/QtCoroutings - a recent project, and the older ckamm/qt-coroutine. I'm not sure how fresh the latter code is. It looks that it all worked at some point.
Writing asynchronous code cleanly without coroutines is usually accomplished through state machines, see this answer for an example, and QP framework for an implementation different from QStateMachine.
Personal anecdote: I couldn't wait for C++ coroutines to become production-ready, and I now write asynchronous communication code in golang, and statically link that into a Qt application. Works great, the garbage collector is unnoticeable, and the code is way easier to read and write than C++ with coroutines. I had a lot of code written using C++ coroutines TS, but moved it all to golang and I don't regret it.
A nested event loop will lead to ordering inversion. (at least on qt4)
Lets say you have the following sequence of things happening
enqueued in outer loop: 1,2,3
processing 1 => spawn inner loop
enqueue 4 in inner loop
processing 4
exit inner loop
processing 2
So you see the processing order was: 1,4,2,3.
I speak from experience and this usually resulted in a crash in my code.

Difference between write() and printf()

Recently I am studying operating system..I just wanna know:
What’s the difference between a system call (like write()) and a standard library function (like printf())?
A system call is a call to a function that is not part of the application but is inside the kernel. The kernel is a software layer that provides you some basic functionalities to abstract the hardware to you. Roughly, the kernel is something that turns your hardware into software.
You always ultimately use write() to write anything on a peripheral whatever is the kind of device you write on. write() is designed to only write a sequence of bytes, that's all and nothing more. But as write() is considered too basic (you may want to write an integer in ten basis, or a float number in scientific notation, etc), different libraries are provided to you by different kind of programming environments to ease you.
For example, the C programming langage gives you printf() that lets you write data in many different formats. So, you can understand printf() as a function that convert your data into a formatted sequence of bytes and that calls write() to write those bytes onto the output. But C++ gives you cout; Java System.out.println, etc. Each of these functions ends to a call to write() (at least on POSIX systems).
One thing to know (important) is that such a system call is costly! It is not a simple function call because you need to call something that is outside of your own code and the system must ensure that you are not trying to do nasty things, etc. So it is very common in higher print-like function that some buffering is built-in; such that write is not always called, but your data are kept into some hidden structure and written only when it is really needed or necessary (buffer is full or you really want to see the result of your print).
This is exactly what happens when you manage your money. If many people gives you 5 bucks each, you won't go deposit each to the bank! You keep them on your wallet (this is the print) up to the point it is full or you don't want to keep them anymore. Then you go to the bank and make a big deposit (this is the write). And you know that putting 5 bucks to your wallet is much much faster than going to the bank and make the deposit. The bank is the kernel/OS.
System calls are implemented by the operating system, and run in kernel mode. Library functions are implemented in user mode, just like application code. Library functions might invoke system calls (e.g. printf eventually calls write), but that depends on what the library function is for (math functions usually don't need to use the kernel).
System Call's in OS are used in interacting with the OS. E.g. Write() could be used something into the system or into a program.
While Standard Library functions are program specific, E.g. printf() will print something out but it will only be in GUI/command line and wont effect system.
Sorry couldnt comment, because i need 50 reputation to comment.
EDIT: Barmar has good answer
I am writing a small program. At the moment it just reads each line from stdin and prints it to stdout. I can add a call to write in the loop, and it would add a few characters at the end of each line. But when I use printf instead, then all the extra characters are clustered and appear all at once, instead of appearing on each line.
It seems that using printf causes stderr to be buffered. Adding fflush(stdout); after calling printf fixes the discrepancy in output.
I'd like to mention another point that the stdio buffers are maintained in a process’s user-space memory, while system call write transfers data directly to a kernel buffer. It means that if you fork a process after write and printf calls, flushing may bring about to give output three times subject to line-buffering and block-buffering, two of them belong to printf call since stdio buffers are duplicated in the child by fork.
printf() is one of the APIs or interfaces exposed to user space to call functions from C library.
printf() actually uses write() system call. The write() system call is actually responsible for sending data to the output.

How does functional programming avoid state when it seems unavoidable?

Let's say we define a function c sum(a, b), functional programming -style, that returns the sum of its arguments. So far so good; all the nice things of FP without any problems.
Now let's say we run this in an environment with dynamic typing and a singleton, stateful error stream. Then let's say we pass a value of a and/or b that sum isn't designed to handle (i.e. not numbers), and it needs to indicate an error somehow.
But how? This function is supposed to be pure and side-effect-less. How does it insert an error into the global error stream without violating that?
No programming language that I know of has anything like a "singleton stateful error stream" built in, so you'd have to make one. And you simply wouldn't make such a thing if you were trying to write your program in a pure functional style.
You could, however, have a sum function that returns either the sum or an indication of an error. The type used to do this is in fact often known by the name Either. Then you could easily make a function that invokes a whole bunch of computations that could possibly return an error, and returns a list of all the errors that were encountered in the other computations. That's pretty close to what you were talking about; it's just explicitly returned rather than being global.
Remember, the question when you're writing a functional program is "how do I make a program that has the behavior I want?" not, "how would I duplicate one particular approach taken in another programming style?". A "global stateful error stream" is a means not an end. You can't have a global stateful error stream in pure function style, no. But ask yourself what you're using the global stateful error stream to achieve; whatever it is, you can achieve that in functional programming, just not with the same mechanism.
Asking whether pure functional programming can implement a particular technique that depends on side effects is like asking how you use techniques from assembly in object-oriented programming. OO provides different tools for you to use to solve problems; limiting yourself to using those tools to emulate a different toolset is not going to be an effective way to work with them.
In response to comments: If what you want to achieve with your error stream is logging error messages to a terminal, then yes, at some level the code is going to have to do IO to do that.1
Printing to terminal is just like any other IO, there's nothing particularly special about it that makes it worthy of singling out as a case where state seems especially unavoidable. So if this turns your question into "How do pure functional programs handle IO?", then there are no doubt many duplicate questions on SO, not to mention many many blog posts and tutorials speaking precisely to that issue. It's not like it's a sudden surprise to implementors and users of pure programming languages, the question has been around for decades, and there have been some quite sophisticated thought put into the answers.
There are different approaches taken in different languages (IO monad in Haskell, unique modes in Mercury, lazy streams of requests and responses in historical versions of Haskell, and more). The basic idea is to come up with a model which can be manipulated by pure code, and hook up manipulations of the model to actual impure operations within the language implementation. This allows you to keep the benefits of purity (the proofs that apply to pure code but not to general impure code will still apply to code using the pure IO model).
The pure model has to be carefully designed so that you can't actually do anything with it that doesn't make sense in terms of actual IO. For example, Mercury does IO by having you write programs as if you're passing around the current state of the universe as an extra parameter. This pure model accurately represents the behaviour of operations that depend on and affect the universe outside the program, but only when there is exactly one state of the universe in the system at any one time, which is threaded through the entire program from start to finish. So some restrictions are put in
The type io is made abstract so that there's no way to construct a value of that type; the only way you can get one is to be passed one from your caller. An io value is passed into the main predicate by the language implementation to kick the whole thing off.
The mode of the io value passed in to main is declared such that it is unique. This means you can't do things that might cause it to be duplicated, such as putting it in a container or passing the same io value to multiple different invocations. The unique mode ensures that you can only ass the io value to a predicate that also uses the unique mode, and as soon as you pass it once the value is "dead" and can't be passed anywhere else.
1 Note that even in imperative programs, you gain a lot of flexibility if you have your error logging system return a stream of error messages and then only actually make the decision to print them close to the outermost layer of the program. If your log calls are directly writing the output immediately, here's just a few things I can think of off the top of my head that become much harder to do with such a system:
Speculatively execute a computation and see whether it failed by checking whether it emitted any errors
Combine multiple high level systems into a single system, adding tags to the logs to distinguish each system
Emit debug and info log messages only if there is also an error message (so the output is clean when there are no errors to debug, and rich in detail when there are)

Is a preemptive multitasking OS possible on the interruptless DCPU-16?

I am looking into various OS designs in the hopes of writing a simple multitasking OS for the DCPU-16. However, everything I read about implementation of preemptive multitasking is centered around interrupts. It sounds like in the era of 16-bit hardware and software, cooperative multitasking was more common, but that requires every program to be written with multitasking in mind.
Is there any way to implement preemptive multitasking on an interruptless architecture? All I can think of is an interpreter which would dynamically switch tasks, but that would have a huge performance hit (possibly on the order of 10-20x+ if it had to parse every operation and didn't let anything run natively, I'm imagining).
Preemptive multitasking is normally implemented by having interrupt routines post status changes/interesting events to a scheduler, which decides which tasks to suspend, and which new tasks to start/continue based on priority. However, other interesting events can occur when a running task makes a call to an OS routine, which may have the same effect.
But all that matters is that some event is noted somewhere, and the scheduler decides who to run. So you can make all such event signalling/scheduling occur only only on OS calls.
You can add egregious calls to the scheduler at "convenient" points in various task application code to make your system switch more often. Whether it just switches, or uses some background information such as elapsed time since the last call is a scheduler detail.
Your system won't be as responsive as one driven by interrupts, but you've already given that up by choosing the CPU you did.
Actually, yes. The most effective method is to simply patch run-times in the loader. Kernel/daemon stuff can have custom patches for better responsiveness. Even better, if you have access to all the source, you can patch in the compiler.
The patch can consist of a distributed scheduler of sorts. Each program can be patched to have a very low-latency timer; on load, it will set the timer, and on each return from the scheduler, it will reset it. A simplistic method would allow code to simply do an
if (timer - start_timer) yield to scheduler;
which doesn't yield too big a performance hit. The main trouble is finding good points to pop them in. In between every function call is a start, and detecting loops and inserting them is primitive but effective if you really need to preempt responsively.
It's not perfect, but it'll work.
The main issue is making sure that the timer return is low latency; that way it is just a comparison and branch. Also, handling exceptions - errors in the code that cause, say, infinite loops - in some way. You can technically use a fairly simple hardware watchdog timer and assert a reset on the CPU without clearing any of the RAM; an in-RAM routine would be where RESET vector points, which would inspect and unwind the stack back to the program call (thus crashing the program but preserving everything else). It's sort of like a brute-force if-all-else-fails crash-the-program. Or you could POTENTIALLY change it to multi-task this way, RESET as an interrupt, but that is much more difficult.
So...yes. It's possible but complicated; using techniques from JIT compilers and dynamic translators (emulators use them).
This is a bit of a muddled explanation, I know, but I am very tired. If it's not clear enough I can come back and clear it up tomorrow.
By the way, asserting reset on a CPU mid-program sounds crazy, but it is a time-honored and proven technique. Early versions of Windows even did it to run compatibility mode on, I think 386's, properly, because there was no way to switch back to 32-bit from 16-bit mode. Other processors and OSes have done it too.
EDIT: So I did some research on what the DCPU is, haha. It's not a real CPU. I have no idea if you can assert reset in Notch's emulator, I would ask him. Handy technique, that is.
I think your assessment is correct. Preemptive multitasking occurs if the scheduler can interrupt (in the non-inflected, dictionary sense) a running task and switch to another autonomously. So there has to be some sort of actor that prompts the scheduler to action. If there are no interrupting devices (in the inflected, technical sense) then there's little you can do in general.
However, rather than switching to a full interpreter, one idea that occurs is just dynamically reprogramming supplied program code. So before entry into a process, the scheduler knows full process state, including what program counter value it's going to enter at. It can then scan forward from there, substituting, say, either the twentieth instruction code or the next jump instruction code that isn't immediately at the program counter with a jump back into the scheduler. When the process returns, the scheduler puts the original instruction back in. If it's a jump (conditional or otherwise) then it also effects the jump appropriately.
Of course, this scheme works only if the program code doesn't dynamically modify itself. And in that case you can preprocess it so that you know in advance where jumps are without a linear search. You could technically allow well-written self-modifying code if it were willing to nominate all addresses that may be modified, allowing you definitely to avoid those in your scheduler's dynamic modifications.
You'd end up sort of running an interpreter, but only for jumps.
another way is to keep to small tasks based on an event queue (like current GUI apps)
this is also cooperative but has the effect of not needing OS calls you just return from the task and then it will go on to the next task
if you then need to continue a task you need to pass the next "function" and a pointer to the data you need to the task queue

Resources