How are branch mispredictions handled before a hardware interrupt - intel

A hardware interrupt occurs to a particular vector (not masked), CPU checks IF flag and pushes RFLAGS, CS and RIP to the stack, meanwhile there are still instructions completing in the back end, one of these instruction's branch predictions turns out to be wrong. Usually the pipeline would be flushed and the front end starts fetching from the correct address but in this scenario an interrupt is in progress.
When an interrupt occurs, what happens to instructions in the pipeline?
I have read this and clearly a solution is to immediately flush everything from the pipeline so that this doesn't occur and then generate the instructions to push the RFLAGS, CS, RIP to the location of the kernel stack in the TSS; however, the question arises, how does it know the (CS:)RIP associated with the most recent architectural state in order to be able to push it on the stack (given that the front end RIP would now be ahead). This is similar to the question of how the taken branch execution unit on port0 knows the (CS:)RIP of what should have been fetched when the take prediciton turns out to be wrong -- is the address encoded into the instruction as well as the prediction? The same issue arises when you think of a trap / exception, the CPU needs to push the address of the current instruction (fault) or the next instruction (trap) to the kernel stack, but how does it work out the address of this instruction when it is halfway down the pipeline -- this leads me to believe that the address must be encoded into the instruction and is worked out using the length information and this is possibly all done at predecode stage..

The CPU will presumably discard the contents of the ROB, rolling back to the latest retirement state before servicing the interrupt.
An in-flight branch miss doesn't change this. Depending on the CPU (older / simpler), it might have already been in the process of rolling back to retirement state and flushing because of a branch miss, when the interrupt arrived.
As #Hadi says, the CPU could choose at that point to retire the branch (with the interrupt pushing a CS:RIP pointing to the correct branch target), instead of leaving it to be re-executed after returning from the interrupt.
But that only works if the branch instruction was already ready to retire: there were no instructions older than the branch still not executed. Since it's important to discover branch misses as early as possible, I assume branch recovery starts when it discovers a mispredict during execution, not waiting until it reaches retirement. (This is unlike other kinds of faults: e.g. Meltdown and L1TF are based on a faulting load not triggering #PF fault handling until it reaches retirement so the CPU is sure there really is a fault on the true path of execution. You don't want to start an expensive pipeline flush until you're sure it wasn't in the shadow of a mispredict or earlier fault.)
But since branch misses don't take an exception, redirecting the front-end can start early before we're sure that the branch instruction is part of the right path in the first place.
e.g. cmp byte [cache_miss_load], 123 / je mispredicts but won't be discovered for a long time. Then in the shadow of that mispredict, a cmp eax, 1 / je on the "wrong" path runs and a mispredict is discovered for it. With fast recovery, uops past that are flushed and fetch/decode/exec from the "right" path can start before the earlier mispredict is even discovered.
To keep IRQ latency low, CPUs don't tend to give in-flight instructions extra time to retire. Also, any retired stores that still have their data in the store buffer (not yet committed to L1d) have to commit before any stores by the interrupt handler can commit. But interrupts are serializing (I think), and any MMIO or port-IO in a handler will probably involve a memory barrier or strongly-ordered store, so letting more instructions retire can hurt IRQ latency if they involve stores. (Once a store retires, it definitely needs to happen even while its data is still in the store buffer).
The out-of-order back-end always knows how to roll back to a known-good retirement state; the entire contents of the ROB are always considered speculative because any load or store could fault, and so can many other instructions1. Speculation past branches isn't super-special.
Branches are only special in having extra tracking for fast recovery (the Branch Order Buffer in Nehalem and newer) because they're expected to mispredict with non-negligible frequency during normal operation. See What exactly happens when a skylake CPU mispredicts a branch? for some details. Especially David Kanter's quote:
Nehalem enhanced the recovery from branch mispredictions, which has been carried over into Sandy Bridge. Once a branch misprediction is discovered, the core is able to restart decoding as soon as the correct path is known, at the same time that the out-of-order machine is clearing out uops from the wrongly speculated path. Previously, the decoding would not resume until the pipeline was fully flushed.
(This answer is intentionally very Intel-centric because you tagged it intel, not x86. I assume AMD does something similar, and probably most out-of-order uarches for other ISAs are broadly similar. Except that memory-order mis-speculation isn't a thing on CPUs with a weaker memory model where CPUs are allowed to visibly reorder loads.)
Footnote 1: So can div, or any FPU instruction if FP exceptions are unmasked. And a denormal FP result could require a microcode assist to handle, even with FP exceptions masked like they are by default.
On Intel CPUs, a memory-order mis-speculation can also result in a pipeline nuke (load speculatively done early, before earlier loads complete, but the cache lost its copy of the line before the x86 memory model said the load could take its value).

In general, each entry in the the ReOrder Buffer (ROB) has a field that is used to store enough information about the instruction address to reconstruct the whole instruction address unambiguously. It may be too costly to store the whole address for each instruction in the ROB. For instructions that have not yet been allocated (i.e., not yet passed the allocation stage of the pipeline), they need to carry this information with them at least until they reach the allocation stage.
If an interrupt and a branch misprediction occur at the same time, the proessor may, for example, choose to service the interrupt. In this case, all the instructions that are on the mispredicted path need to be flushed. The processor may choose also to flush other instructions that are on the correct path, but have not yet retired. All of these instructions are in the ROB and their instruction addresses are known. For each speculated branch, there is a tag that identifies all instructions on that speculated path and all instructions on this path are tagged with it. If there is another, later speculated branch, another tag is used, but it is also ordered with respect to the previous tag. Using these tags, the processor can determine exactly which instructions to flush when any of the speculated branches turns out to be incorrect. This is determined after the corresponding branch instruction completes execution in the branch execution unit. Branches may complete execution out of order. When the correct address of a msipredicted branch is calculated, it's forwarded to the fetch unit and the branch prediction unit (BPU). The fetch unit uses it to fetch instructions from the correct path and the BPU uses it to update its prediction state.
The processor can choose to retire the mispredicted branch instruction itself and flush all other later instructions. All rename registers are reclaimed and those physical registers that are mapped to architectural registers at the point the branch is retired are retained. At this point, the processor executes instructions to save the current state and then begins fetching instructions of the interrupt handler.

Related

When a process makes a system call to transmit a TCP packet over the network, which of the following steps do NOT occur always?

I am teaching myself OS by going through the lecture notes of the course at IIT Bombay (https://www.cse.iitb.ac.in/~mythili/os/). One of the questions in the Process worksheet asks which of the following doesn't always happen in the situation described at the title. The answer is C.
A. The process moves to kernel mode.
B. The program counter of the CPU shifts to the kernel part of the address space.
C. The process is context-switched out and a separate kernel process starts execution.
D. The OS code that deals with handling TCP/IP packets is invoked
I'm a bit confused though. I thought when an interrupt routine occurs the process is context-switched out so other processes can run and the CPU is not idle during that time. The kernel, then, will take care of the packet sending. Why would C not be correct then?
You are right in saying that "when an interrupt routine occurs the process is context-switched out so other processes can run and the CPU is not idle during that time", but the words "generally or mostly" need to be added to it.
In most cases, there is another process waiting for CPU time and that can be scheduled. However it is not the case 100% of the time. The question is about the word "always" and while other options always occur in the given situation, option C is a choice that OS makes at run time. If OS determines that switching out this process can be sub optimal than performing the system call and resuming the same process, then it may not perform the context switching.
There is a cost associated with context switching and if other processes are also blocked on some I/O then it may be optimal for OS to NOT switch the context or there might be other reasons to not switch the context such as what if only 1 process is running, there is no other process to switch the context to!

Cooperative Multitasking system

I'm trying to get around the concept of cooperative multitasking system and exactly how it works in a single threaded application.
My understanding is that this is a "form of multitasking in which multiple tasks execute by voluntarily ceding control to other tasks at programmer-defined points within each task."
So if you have a list of tasks and one task is executing, how do you determine to pass execution to another task? And when you give execution back to a previous task, how do resume from where you were previously?
I find this a bit confusing because I don't understand how this can be achieve without a multithreaded application.
Any advice would be very helpeful :)
Thanks
In your specific scenario where a single process (or thread of execution) uses cooperative multitasking, you can use something like Windows' fibers or POSIX setcontext family of functions. I will use the term fiber here.
Basically when one fiber is finished executing a chunk of work and wants to voluntarily allow other fibers to run (hence the "cooperative" term), it either manually switches to the other fiber's context or more typically it performs some kind of yield() or scheduler() call that jumps into the scheduler's context, then the scheduler finds a new fiber to run and switches to that fiber's context.
What do we mean by context here? Basically the stack and registers. There is nothing magic about the stack, it's just a block of memory the stack pointer happens to point to. There is also nothing magic about the program counter, it just points to the next instruction to execute. Switching contexts simply saves the current registers somewhere, changes the stack pointer to a different chunk of memory, updates the program counter to a different stream of instructions, copies that context's saved registers into the CPU, then does a jump. Bam, you're now executing different instructions with a different stack. Often the context switch code is written in assembly that is invoked in a way that doesn't modify the current stack or it backs out the changes, in either case it leaves no traces on the stack or in registers so when code resumes execution it has no idea anything happened. (Again, the theme: we assume that method calls fiddle with registers, push arguments to the stack, move the stack pointer, etc but that is just the C calling convention. Nothing requires you to maintain a stack at all or to have any particular method call leave any traces of itself on the stack).
Since each stack is separate, you don't have some continuous chain of seemingly random method calls eventually overflowing the stack (which might be the result if you naively tried to implement this scheme using standard C methods that continuously called each other). You could implement this manually with a state machine where each fiber kept a state machine of where it was in its work, periodically returning to the calling dispatcher's method, but why bother when actual fiber/co-routine support is widely available?
Also remember that cooperative multitasking is orthogonal to processes, protected memory, address spaces, etc. Witness Mac OS 9 or Windows 3.x. They supported the idea of separate processes. But when you yielded, the context was changed to the OS context, allowing the OS scheduler to run, which then potentially selected another process to switch to. In theory you could have a full protected virtual memory OS that still used cooperative multitasking. In those systems, if a errant process never yielded, the OS scheduler never ran, so all other processes in the system were frozen. **
The next natural question is what makes something pre-emptive... The answer is that the OS schedules an interrupt timer with the CPU to stop the currently executing task and switch back to the OS scheduler's context regardless of whether the current task cares to release the CPU or not, thus "pre-empting" it.
If the OS uses CPU privilege levels, the (kernel configured) timer is not cancelable by lower level (user mode) code, though in theory if the OS didn't use such protections an errant task could mask off or cancel the interrupt timer and hijack the CPU. There are some other scenarios like IO calls where the scheduler can be invoked outside the timer, and the scheduler may decide no other process has higher priority and return control to the same process without a switch... And in reality most OSes don't do a real context switch here because that's expensive, the scheduler code runs inside the context of whatever process was executing, so it has to be very careful not to step on the stack, to save register states, etc.
** You might ask why not just fire a timer if yield isn't called within a certain period of time. The answer lies in multi-threaded synchronization. In a cooperative system, you don't have to bother taking locks, worry about re-entrance, etc because you only yield when things are in a known good state. If this mythical timer fires, you have now potentially corrupted the state of the program that was interrupted. If programs have to be written to handle this, congrats... You now have a half-assed pre-emptive multitasking system. Might as well just do it right! And if you are changing things anyway, may as well add threads, protected memory, etc. That's pretty much the history of the major OSes right there.
The basic idea behind cooperative multitasking is trust - that each subtask will relinquish control, of its own accord, in a timely fashion, to avoid starving other tasks of processor time. This is why tasks in a cooperative multitasking system need to be tested extremely thoroughly, and in some cases certified for use.
I don't claim to be an expert, but I imagine cooperative tasks could be implemented as state machines, where passing control to the task would cause it to run for the absolute minimal amount of time it needs to make any kind of progress. For example, a file reader might read the next few bytes of a file, a parser might parse the next line of a document, or a sensor controller might take a single reading, before returning control back to a cooperative scheduler, which would check for task completion.
Each task would have to keep its internal state on the heap (at object level), rather than on the stack frame (at function level) like a conventional blocking function or thread.
And unlike conventional multitasking, which relies on a hardware timer to trigger a context switch, cooperative multitasking relies on the code to be written in such a way that each step of each long-running task is guaranteed to finish in an acceptably small amount of time.
The tasks will execute an explicit wait or pause or yield operation which makes the call to the dispatcher. There may be different operations for waiting on IO to complete or explicitly yielding in a heavy computation. In an application task's main loop, it could have a *wait_for_event* call instead of busy polling. This would suspend the task until it has input to process.
There may also be a time-out mechanism for catching runaway tasks, but it is not the primary means of switching (or else it wouldn't be cooperative).
One way to think of cooperative multitasking is to split a task into steps (or states). Each task keeps track of the next step it needs to execute. When it's the task's turn, it executes only that one step and returns. That way, in the main loop of your program you are simply calling each task in order, and because each task only takes up a small amount of time to complete a single step, we end up with a system which allows all of the tasks to share cpu time (ie. cooperate).

Limiting TCP sends with a "to-be-sent" queue and other design issues

This question is the result of two other questions I've asked in the last few days.
I'm creating a new question because I think it's related to the "next step" in my understanding of how to control the flow of my send/receive, something I didn't get a full answer to yet.
The other related questions are:
An IOCP documentation interpretation question - buffer ownership ambiguity
Non-blocking TCP buffer issues
In summary, I'm using Windows I/O Completion Ports.
I have several threads that process notifications from the completion port.
I believe the question is platform-independent and would have the same answer as if to do the same thing on a *nix, *BSD, Solaris system.
So, I need to have my own flow control system. Fine.
So I send send and send, a lot. How do I know when to start queueing the sends, as the receiver side is limited to X amount?
Let's take an example (closest thing to my question): FTP protocol.
I have two servers; One is on a 100Mb link and the other is on a 10Mb link.
I order the 100Mb one to send to the other one (the 10Mb linked one) a 1GB file. It finishes with an average transfer rate of 1.25MB/s.
How did the sender (the 100Mb linked one) knew when to hold the sending, so the slower one wouldn't be flooded? (In this case the "to-be-sent" queue is the actual file on the hard-disk).
Another way to ask this:
Can I get a "hold-your-sendings" notification from the remote side? Is it built-in in TCP or the so called "reliable network protocol" needs me to do so?
I could of course limit my sendings to a fixed number of bytes but that simply doesn't sound right to me.
Again, I have a loop with many sends to a remote server, and at some point, within that loop I'll have to determine if I should queue that send or I can pass it on to the transport layer (TCP).
How do I do that? What would you do? Of course that when I get a completion notification from IOCP that the send was done I'll issue other pending sends, that's clear.
Another design question related to this:
Since I am to use a custom buffers with a send queue, and these buffers are being freed to be reused (thus not using the "delete" keyword) when a "send-done" notification has been arrived, I'll have to use a mutual exlusion on that buffer pool.
Using a mutex slows things down, so I've been thinking; Why not have each thread have its own buffers pool, thus accessing it , at least when getting the required buffers for a send operation, will require no mutex, because it belongs to that thread only.
The buffers pool is located at the thread local storage (TLS) level.
No mutual pool implies no lock needed, implies faster operations BUT also implies more memory used by the app, because even if one thread already allocated 1000 buffers, the other one that is sending right now and need 1000 buffers to send something will need to allocated these to its own.
Another issue:
Say I have buffers A, B, C in the "to-be-sent" queue.
Then I get a completion notification that tells me that the receiver got 10 out of 15 bytes. Should I re-send from the relative offset of the buffer, or will TCP handle it for me, i.e complete the sending? And if I should, can I be assured that this buffer is the "next-to-be-sent" one in the queue or could it be buffer B for example?
This is a long question and I hope none got hurt (:
I'd loveeee to see someone takes the time to answer here. I promise I'll double-vote for him! (:
Thank you all!
Firstly: I'd ask this as separate questions. You're more likely to get answers that way.
I've spoken about most of this on my blog: http://www.lenholgate.com but then since you've already emailed me to say that you read my blog you know that...
The TCP flow control issue is such that since you are posting asynchronous writes and these each use resources until they complete (see here). During the time that the write is pending there are various resource usage issues to be aware of and the use of your data buffer is the least important of them; you'll also use up some non-paged pool which is a finite resource (though there is much more available in Vista and later than previous operating systems), you'll also be locking pages in memory for the duration of the write and there's a limit to the total number of pages that the OS can lock. Note that both the non-paged pool usage and page locking issues aren't something that's documented very well anywhere, but you'll start seeing writes fail with ENOBUFS once you hit them.
Due to these issues it's not wise to have an uncontrolled number of writes pending. If you are sending a large amount of data and you have a no application level flow control then you need to be aware that if you send data faster than it can be processed by the other end of the connection, or faster than the link speed, then you will begin to use up lots and lots of the above resources as your writes take longer to complete due to TCP flow control and windowing issues. You don't get these problems with blocking socket code as the write calls simply block when the TCP stack can't write any more due to flow control issues; with async writes the writes complete and are then pending. With blocking code the blocking deals with your flow control for you; with async writes you could continue to loop and more and more data which is all just waiting to be sent by the TCP stack...
Anyway, because of this, with async I/O on Windows you should ALWAYS have some form of explicit flow control. So, you either add application level flow control to your protocol, using an ACK, perhaps, so that you know when the data has reached the other side and only allow a certain amount to be outstanding at any one time OR if you cant add to the application level protocol, you can drive things by using your write completions. The trick is to allow a certain number of outstanding write completions per connection and to queue the data (or just don't generate it) once you have reached your limit. Then as each write completes you can generate a new write....
Your question about pooling the data buffers is, IMHO, premature optimisation on your part right now. Get to the point where your system is working properly and you have profiled your system and found that the contention on your buffer pool is the most important hot spot and THEN address it. I found that per thread buffer pools didn't work so well as the distribution of allocations and frees across threads tends not to be as balanced as you'd need to that to work. I've spoken about this more on my blog: http://www.lenholgate.com/blog/2010/05/performance-comparisons-for-recent-code-changes.html
Your question about partial write completions (you send 100 bytes and the completion comes back and says that you have only sent 95) isn't really a problem in practice IMHO. If you get to this position and have more than the one outstanding write then there's nothing you can do, the subsequent writes may well work and you'll have bytes missing from what you expected to send; BUT a) I've never seen this happen unless you have already hit the resource problems that I detail above and b) there's nothing you can do if you have already posted more writes on that connection so simply abort the connection - note that this is why I always profile my networking systems on the hardware that they will run on and I tend to place limits in MY code to prevent the OS resource limits ever being reached (bad drivers on pre Vista operating systems often blue screen the box if they can't get non paged pool so you can bring a box down if you don't pay careful attention to these details).
Separate questions next time, please.
Q1. Most APIs will give you "write is possible" event, after you last wrote and writing is available again (can happen immediately if you failed to fill major part of send buffer with the last send).
With completion port, it will arrive just as "new data" event. Think of new data as "read Ok", so there's also a "write ok" event. Names differ between the APIs.
Q2. If a kernel mode transition for mutex acquisition per chunk of data hurts you, I recommend rethinking what you are doing. It takes 3 microseconds at most, while your thread scheduler slice may be as big as 60 milliseconds on windows.
It may hurt in extreme cases. If you think you are programming extreme communications, please ask again, and I promise to tell you all about it.
To address your question about when it knew to slow down, you seem to lack an understanding of TCP congestion mechanisms. "Slow start" is what you're talking about, but it's not quite how you've worded it. Slow start is exactly that -- starts off slow, and gets faster, up to as fast as the other end is willing to go, wire line speed, whatever.
With respect to the rest of your question, Pavel's answer should suffice.

What exactly are "spin-locks"?

I always wondered what they are: every time I hear about them, images of futuristic flywheel-like devices go dancing (rolling?) through my mind...
What are they?
When you use regular locks (mutexes, critical sections etc), operating system puts your thread in the WAIT state and preempts it by scheduling other threads on the same core. This has a performance penalty if the wait time is really short, because your thread now has to wait for a preemption to receive CPU time again.
Besides, kernel objects are not available in every state of the kernel, such as in an interrupt handler or when paging is not available etc.
Spinlocks don't cause preemption but wait in a loop ("spin") till the other core releases the lock. This prevents the thread from losing its quantum and continue as soon as the lock gets released. The simple mechanism of spinlocks allows a kernel to utilize it in almost any state.
That's why on a single core machine a spinlock is simply a "disable interrupts" or "raise IRQL" which prevents thread scheduling completely.
Spinlocks ultimately allow kernels to avoid "Big Kernel Lock"s (a lock acquired when core enters kernel and released at the exit) and have granular locking over kernel primitives, causing better multi-processing on multi-core machines thus better performance.
EDIT: A question came up: "Does that mean I should use spinlocks wherever possible?" and I'll try to answer it:
As I mentioned, Spinlocks are only useful in places where anticipated waiting time is shorter than a quantum (read: milliseconds) and preemption doesn't make much sense (e.g. kernel objects aren't available).
If waiting time is unknown, or if you're in user mode Spinlocks aren't efficient. You consume 100% CPU time on the waiting core while checking if a spinlock is available. You prevent other threads from running on that core till your quantum expires. This scenario is only feasible for short bursts at kernel level and unlikely an option for a user-mode application.
Here is a question on SO addressing that: Spinlocks, How Useful Are They?
Say a resource is protected by a lock ,a thread that wants access to the resource needs to acquire the lock first. If the lock is not available, the thread might repeatedly check if the lock has been freed. During this time the thread busy waits, checking for the lock, using CPU, but not doing any useful work. Such a lock is termed as a spin lock.
It is pertty much a loop that keeps going till a certain condition is met:
while(cantGoOn) {};
while(something != TRUE ){};
// it happend
move_on();
It's a type of lock that does busy waiting
It's considered an anti-pattern, except for very low-level driver programming (where it can happen that calling a "proper" waiting function has more overhead than simply busy locking for a few cycles).
See for example Spinlocks in Linux kernel.
SpinLocks are the ones in which thread waits till the lock is available. This will normally be used to avoid overhead of obtaining the kernel objects when there is a scope of acquiring the kernel object within some small time period.
Ex:
While(SpinCount-- && Kernel Object is not free)
{}
try acquiring Kernel object
You would want to use a spinlock when you think it is cheaper to enter a busy waiting loop and pool a resource instead of blocking when the resource is locked.
Spinning can be beneficial when locks are fine grained and large in number (for example, a lock per node in a linked list) as well as when lock hold times are always extremely short. In general, while holding a spin lock, one should avoid blocking, calling anything that itself may block, holding more than one spin lock at once, making dynamically dispatched calls (interface and virtuals), making statically dispatched calls into any code one doesn't own, or allocating memory.
It's also important to note that SpinLock is a value type, for performance reasons. As such, one must be very careful not to accidentally copy a SpinLock instance, as the two instances (the original and the copy) would then be completely independent of one another, which would likely lead to erroneous behavior of the application. If a SpinLock instance must be passed around, it should be passed by reference rather than by value.
It's a loop that spins around until a condition is met.
In nutshell, spinlock employs atomic compare and swap (CAS) or test-and-set like instructions to implement lock free, wait free thread safe idiom. Such structures scale well in multi-core machines.
Well, yes - the point of spin locks (vs a traditional critical sections, etc) is that they offer better performance under some circumstances (multicore systems..), because they don't immediately yield the rest of the thread's quantum.
Spinlock, is a type of lock, which is non-block able & non-sleep-able. Any thread which want to acquire a spinlock for any shared or critical resource will continuously spin, wasting the CPU processing cycle till it acquire the lock for the specified resource. Once spinlock is acquired, it try to complete the work in its quantum and then release the resource respectively. Spinlock is the highest priority type of lock, simply can say, it is non-preemptive kind of lock.

process scheduling question

For example, a process waiting for
disk I/O to complete will sleep on the
address of the buffer header
corresponding to the data being
transferred. When the interrupt
routine for the disk driver notes that
the transfer is complete, it calls
wakeup on the buffer header. The
interrupt uses the kernel stack for
whatever process happened to be
running at the time, and the wakeup is
done from that system process.
Can you please explain the last line in the paragraph which I have emphasised. It is about waking up the process which has been waiting for some event to occur and thus has slept. This para is from Galvin. By the way can you suggest some good book or link for studying unix operating systems?
Thanks.
There is some process running at the time the interrupt is received. The kernel doesn't change over to some other process context to handle it -- that would take time -- it just does what's necessary in the current context, and lets the scheduler know that the next time it schedules, the waiting process is ready to proceed.
There are a number of good internals books around. I'm fond of the various McKusick et al books, like The Design and Implementation of the FreeBSD Operating System.
Maurice Bach's Design of the Unix Operating System is the most well-known and comprehensive book on the subject.
The I/O completion interrupt will be executed as soon as the disk signals the end of the transfer. This is done regardless of what the kernel is currently doing. Interrupt handlers are usually very small and self-contained. Therefore it is faster to re-use the current runtime environment (stack, CPU state, etc) instead of doing a full context switch to a separate thread. On the down side this means that interrupt handlers are only allowed to do very limited things, like setting a flag somewhere else, or enqueing a work item. Also, they have to clean up very carefully after themselves, so that the running process is not disturbed.
Eric Raymond's 'The Art of Unix Programming' , should be read to understand the Unix philosophy and culture.To actually know and appreciate the reasons behind its design.

Resources